Text Encoding Initiative

Last updated December 11, 2024

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal,^[1] a wiki, a GitHub repository and a toolchain.

TEI guidelines

The TEI Guidelines collectively define a type of XML format, and are the defining output of the community of practice. The format differs from other well-known open formats for text (such as HTML and OpenDocument) in that it is primarily semantic rather than presentational: the semantics and interpretation of every tag and attribute are specified. There are some 500 different textual components and concepts: word,^[2]sentence,^[3]character,^[4]glyph,^[5]person,^[6] etc. Each is grounded in one or more academic disciplines and examples are given.

Technical details

The standard is split into two parts, a discursive textual description with extended examples and discussion and set of tag-by-tag definitions. Schemata in most of the modern formats (DTD, RELAX NG and XML Schema (W3C)) are generated automatically from the tag-by-tag definitions. A number of tools support the production of the guidelines and the application of the guidelines to specific projects.

A number of special tags are used to circumvent restrictions imposed by the underlying Unicode; glyph to allow representation of characters that do not qualify for Unicode inclusion^[2] and choice to allow overcome the required strict linearity.^[7]

Most users of the format do not use the complete range of tags, but produce a customisation using a project-specific subset of the tags and attributes defined by the Guidelines. The TEI defines a sophisticated customization mechanism known as ODD for this purpose. In addition to documenting and describing each TEI tag, an ODD specification specifies its content model and other usage constraints, which may be expressed using schematron.

TEI Lite is an example of such a customization. It defines an XML-based file format for exchanging texts. It is a manageable selection from the extensive set of elements available in the full TEI Guidelines.

As an XML-based format, TEI cannot directly deal with overlapping markup and non-hierarchical structures. A variety of options to represent this sort of data is suggested by the guidelines.^[8]

Examples

The text of the TEI guidelines is rich in examples. There is also a samples page on the TEI wiki,^[9] which gives examples of real-world projects that expose their underlying TEI.

Prose tags

TEI allows texts to be marked up syntactically at any level of granularity, or mixture of granularities. For example, this paragraph (p) has been marked up into sentences (s) and clauses (cl).^[10]

<s><cl>ItwasaboutthebeginningofSeptember,1664, <cl>thatI,amongtherestofmyneighbours, heardinordinarydiscourse <cl>thattheplaguewasreturnedagaintoHolland;</cl></cl></cl><cl>forithadbeenveryviolentthere,andparticularlyat AmsterdamandRotterdam,intheyear1663,</cl><cl>whither,<cl>theysay,</cl>itwasbrought, <cl>somesaid</cl>fromItaly,othersfromtheLevant,amongsomegoods <cl>whichwerebroughthomebytheirTurkeyfleet;</cl></cl><cl>otherssaiditwasbroughtfromCandia; othersfromCyprus.</cl></s><s><cl>Itmatterednot<cl>fromwhenceitcame;</cl></cl><cl>butallagreed<cl>itwascomeintoHollandagain.</cl></cl></s>

Verse

TEI has tags for marking up verse. This example (taken from the French translation of the TEI Guidelines) shows a sonnet.^[11]

<divtype="sonnet"><lgtype="quatrain"><l>Lesamoureuxferventsetlessavantsaustères</l><l>Aimentégalement,dansleurmûresaison,</l><l>Leschatspuissantsetdoux,orgueildelamaison,</l><l>Quicommeeuxsontfrileuxetcommeeuxsédentaires.</l></lg><lgtype="quatrain"><l>Amisdelascienceetdelavolupté</l><l>Ilscherchentlesilenceetl'horreurdesténèbres;</l><l>L'Érèbeleseûtprispoursescoursiersfunèbres,</l><l>S'ilspouvaientauservageinclinerleurfierté.</l></lg><lgtype="tercet"><l>Ilsprennentensongeantlesnoblesattitudes</l><l>Desgrandssphinxallongésaufonddessolitudes,</l><l>Quisemblents'endormirdansunrêvesansfin;</l></lg><lgtype="tercet"><l>Leursreinsfécondssontpleinsd'étincellesmagiques,</l><l>Etdesparcellesd'or,ainsiqu'unsablefin,</l><l>Étoilentvaguementleursprunellesmystiques.</l></lg></div>

Choice tag

The choice tag is used to represent sections of text that might be encoded or tagged in more than one possible way. In the following example, based on one in the standard, choice is used twice, once to indicate an original and a corrected number, and once to indicate an original and regularised spelling.^[12]

<pxml:id="p23">Lastly,That,uponhissolemnoathtoobservealltheabove articles,thesaidman-mountainshallhaveadailyallowanceof meatanddrinksufficientforthesupportof<choice><sic>1724</sic><corr>1728</corr></choice>ofoursubjects, withfreeaccesstoourroyalperson,andothermarksofour <choice><orig>favour</orig><reg>favor</reg></choice>.

ODD

One Document Does it all ("ODD") is a literate programming language for XML schemas.^[13]^[14]^[15]^[16]

In literate-programming style, ODD documents combine human-readable documentation and machine-readable models using the Documentation Elements module of the Text Encoding Initiative. Tools generate localised and internationalised HTML, ePub, or PDF human-readable output and DTDs, W3C XML Schema, Relax NG Compact Syntax, or Relax NG XML Syntax machine-readable output.

The Roma web application^[17] is built around the ODD format and can use it to generate schemas in DTD, W3C XML Schema, Relax NG Compact Syntax, or Relax NG XML Syntax formats, as used by many XML validation tools and services.

ODD is the format used internally by the Text Encoding Initiative for the TEI technical standard.^[18] Although ODD files generally describe the difference between a customized XML format and the full TEI model, ODD also can be used to describe XML formats that are entirely separate from the TEI. One example of this is the W3C's Internationalization Tag Set which uses the ODD format to generate schemas and document its vocabulary.^[19]^[20]

TEI customizations

TEI customizations are specializations of the TEI XML specification for use in particular fields or by specific communities.

EpiDoc (Epigraphic Documents)
Charters Encoding Initiative^[21]
Medieval Nordic Text Archive (Menota)^[22]

Customization in the TEI is done through the ODD mechanism mentioned above. In truth since its P5 version, all so-called 'TEI Conformant' uses of the TEI Guidelines are based on a TEI customization documented in a TEI ODD file. Even when users choose one of the off-the-shelf pre-generated schemas to validate against, these have been created from freely available customization files.

Projects

The format is used by many projects worldwide. Practically all projects are associated with one or more universities. Some well-known projects that encode texts using TEI include:

TEI projects
Project	URL	Subject(s)
British National Corpus	http://www.natcorp.ox.ac.uk	100-million-word snapshot of current English-language usage
Oxford Text Archive	https://ota.bodleian.ox.ac.uk/repository/xmlui/	>1 GB of Linguistic data and electronic texts in 25 languages
Perseus Project	https://www.perseus.tufts.edu/	Greek and Latin texts
EpiDoc	https://sourceforge.net/p/epidoc/wiki/Home/	Epigraphy and papyrology
Women Writers Project	https://wwp.northeastern.edu/	Early modern women writers (Margaret Cavendish, Eliza Haywood, etc.)
New Zealand Electronic Text Centre	http://www.nzetc.org/	New Zealand and Pacific Islands texts
The SWORD Project	https://www.crosswire.org/sword/	Bible software, dictionaries, Christian literature
FreeDict	https://freedict.org/	Bilingual dictionaries
Text Creation Partnership	https://textcreationpartnership.org/	Early British and American books
CELT	https://celt.ucc.ie/publishd.html	Ancient and medieval Irish manuscripts
ISTEX	https://www.istex.fr/	Archives of scientific publications
CAB	https://cab.geschkult.fu-berlin.de/	An edition of the Zoroastrian rituals of the Avesta , in the Avestan languages

History

Prior to the creation of TEI, humanities scholars had no common standards for encoding electronic texts in a manner that would serve their academic goals (Hockey 1993, p. 41). In 1987, a group of scholars representing fields in humanities, linguistics, and computing convened at Vassar College to put forth a set of guidelines known as the “Poughkeepsie Principles”. These guidelines directed the development of the first TEI standard, "P1".^[23]^[24]

1987 – Work started by the Association for Computers and the Humanities,^[25] the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing on what would become the TEI.^[26] This culminated in the Closing statement of the Vassar Planning Conference.^[27]
1994 – TEI P3 released,^[28] co-edited by Lou Burnard (at Oxford University) and Michael Sperberg-McQueen (then at the University of Illinois at Chicago, later at the W3C).
1999 – TEI P3 updated.
2002 – TEI P4 released, moving from SGML to XML; adoption of Unicode, which XML parsers are required to support.^[29]
2007 – TEI P5 released, including integration with the xml:lang and xml:id attributes from the W3C^[30] (these had previously been attributes in the TEI namespace), regularization of local pointing attributes to use the hash (as used in HTML) and unification of the ptr and xptr tags. Together these changes with many more new additions make P5 more regular and bring it closer to current xml practice as promoted by the W3C and as used by other XML variants. Maintenance and feature update versions of TEI P5 have been released at least twice a year since 2007.
2011 – TEI P5 v2.0.1 released with support for genetic editing ^[31] (among many other additions, the genetic-editing features allow encoding of texts without interpretation as to their specific semantics).
2017 – TEI was awarded the Antonio Zampolli Prize from the Alliance of Digital Humanities Organizations.^[32]

Related Research Articles

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

A markuplanguage is a text-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation.

XSD, a recommendation of the World Wide Web Consortium (W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. It can be used by programmers to verify each piece of item content in a document, to assure it adheres to the description of the element it is placed in.

In computing, RELAX NG is a schema language for XML—a RELAX NG schema specifies a pattern for the structure and content of an XML document. A RELAX NG schema is itself an XML document but RELAX NG also offers a popular compact, non-XML syntax. Compared to other XML schema languages RELAX NG is considered relatively simple.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

Apache AxKit was an XML Apache publishing framework run by the Apache foundation written in Perl. It provided conversion from XML to any format, such as HTML, WAP or text using either W3C standard techniques, or flexible custom code.

EpiDoc is an international community that produces guidelines and tools for encoding in TEI XML scholarly and educational editions of ancient documents, especially inscriptions and papyri.

<span class="mw-page-title-main">Oxygen XML Editor</span>

The Oxygen XML Editor is a multi-platform XML editor, XSLT/XQuery debugger and profiler with Unicode support. It is a Java application so it can run in Windows, Mac OS X, and Linux. It also has a version that can run as an Eclipse plugin.

C. Michael Sperberg-McQueen was an American medieval German philologist and markup language specialist. He was founder and co-chair of Extreme Markup Languages, founder and principal of Black Mesa Technologies, co-editor of the Extensible Markup Language (XML) 1.0 spec (1998), and chair of both the W3C XML Coordination Group and the XML Schema Working Group.

Efficient XML Interchange (EXI) is a binary XML format for exchange of data on a computer network. It was developed by the W3C's Efficient Extensible Interchange Working Group and is one of the most prominent efforts to encode XML documents in a binary data format, rather than plain text. Using EXI format reduces the verbosity of XML documents as well as the cost of parsing. Improvements in the performance of writing (generating) content depends on the speed of the medium being written to, the methods and quality of actual implementations. EXI is useful for

The Music Encoding Initiative (MEI) is an open-source effort to create a system for representation of musical documents in a machine-readable structure. MEI closely mirrors work done by text scholars in the Text Encoding Initiative (TEI) and while the two encoding initiatives are not formally related, they share many common characteristics and development practices. The term "MEI", like "TEI", describes the governing organization and the markup language. The MEI community solicits input and development directions from specialists in various music research communities, including technologists, librarians, historians, and theorists in a common effort to discuss and define best practices for representing a broad range of musical documents and structures. The results of these discussions are then formalized into the MEI schema, a core set of rules for recording physical and intellectual characteristics of music notation documents. This schema is expressed in an XML schema Language, with RelaxNG being the preferred format. The MEI schema is developed using the One-Document-Does-it-all (ODD) format, a literate programming XML format developed by the Text Encoding Initiative.

The Office Open XML file formats are a set of file formats that can be used to represent electronic office documents. There are formats for word processing documents, spreadsheets and presentations as well as specific formats for material such as mathematical formulas, graphics, bibliographies etc.

Steven J DeRose is a computer scientist noted for his contributions to computational linguistics and to key standards related to document processing, mostly around ISO's Standard Generalized Markup Language (SGML) and W3C's Extensible Markup Language (XML).

Data Format Description Language is a modeling language for describing general text and binary data in a standard way. It was published as an Open Grid Forum Recommendation in February 2021, and in April 2024 was published as an ISO standard.

In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.

Medieval Nordic Text Archive (Menota) is a network of leading Nordic archives, libraries and research departments working with medieval texts and manuscript facsimiles. The aim of Menota is to preserve and publish medieval texts in digital form and to adapt and develop encoding standards necessary for this work.

Lou Burnard is an internationally recognised expert in digital humanities, particularly in the area of text encoding and digital libraries. He was assistant director of Oxford University Computing Services (OUCS) from 2001 to September 2010, when he officially retired from OUCS. Before that, he was manager of the Humanities Computing Unit at OUCS for five years. He has worked in ICT support for research in the humanities since the 1990s. He was one of the founding editors of the Text Encoding Initiative (TEI) and continues to play an active part in its maintenance and development, as a consultant to the TEI Technical Council and as an elected TEI board member. He has played a key role in the establishment of many other activities and initiatives in this area, such as the UK Arts and Humanities Data Service and the British National Corpus, and has published and lectured widely. Since 2008 he has worked as a Member of the Conseil Scientifique for the CNRS-funded "Adonis" TGE.

References

↑ "Journal of the Text Encoding Initiative". Open Edition Journals. Retrieved 29 June 2022.
1 2 "TEI element w (word)". tei-c.org.
↑ "TEI element s (s-unit)". tei-c.org.
↑ "TEI element c (character)". tei-c.org.
↑ "TEI element g (character or glyph)". tei-c.org.
↑ "TEI element person (person)". tei-c.org.
↑ "Element choice". www.tei-c.org.
↑ "20 Non-hierarchical Structures - TEI P5: — Guidelines for Electronic Text Encoding and Interchange". tei-c.org. 2019. Retrieved 19 March 2019.
↑ "Samples of TEI texts". wiki.tei-c.org. 2011. Retrieved 17 April 2012.
↑ "17 Simple Analytic Mechanisms - TEI P5: — Guidelines for Electronic Text Encoding and Interchange". tei-c.org. 2012. Retrieved 15 April 2012.
↑ "TEI element lg (groupe de vers)". tei-c.org. 2012. Archived from the original on 6 June 2012. Retrieved 15 April 2012.
↑ "TEI element choice". tei-c.org. 2012. Retrieved 15 April 2012.
↑ Bauman, Syd; Flanders, Julia (2004). ODD customizations. Extreme Markup Languages 2004. Archived from the original on 2012-03-29. Retrieved 2012-04-15.
↑ Burnard, Lou; Rahtz, Sebastian (2004). RelaxNG with Son of ODD. Extreme Markup Languages 2004. Archived from the original on 2012-03-29. Retrieved 2012-04-15.
↑ Reiss, Kevin M. (2007). Literate Documentation for XML (PDF). Digital Humanities 2007. Urbana-Champaign, Illinois. Archived from the original (PDF) on 2016-03-03. Retrieved 2012-04-15.
↑ Burnard, Lou; Rahtz, Sebastian (June 2013). "A complete schema definition language for the Text Encoding Initiative". XML London 2013: 152–161. doi: 10.14337/XMLLondon13.Rahtz01 (inactive 1 November 2024). ISBN 978-0-9926471-0-0.{{cite journal}}: CS1 maint: DOI inactive as of November 2024 (link)
↑ Roma web application
↑ Burnard, Lou; Bauman, Syd, eds. (2007). "TEI P5: Guidelines for Electronic Text Encoding and Interchange". Charlottesville, Virginia, USA: TEI Consortium.
↑ Lieske, Christian; Sasaki, Felix, eds. (3 April 2007). "Internationalization Tag Set (ITS) Version 1.0". World Wide Web Consortium. §1.5 Development of this specification.
↑ Savourel, Yves; Kosek, Jirka; Ishida, Richard, eds. (2008). "Best Practices for XML Internationalization". W3C Working Group. 5.2 ITS and TEI.
↑ "Charters Encoding Initiative - Ludwig-Maximilians-Universität München". www.cei.lmu.de.
↑ "Medieval Nordic Text Archive (Menota)". www.menota.org.
↑ Ahronheim, J.R. (1998). "Descriptive metadata: Emerging standards". Journal of Academic Librarianship. 24 (5): 395–403. doi:10.1016/S0099-1333(98)90079-9.
↑ Cantara, L. (2005). "The text-encoding initiative: Part 1". OCLC Systems & Services. 21 (1): 36–39. doi:10.1108/10650750510578136.
↑ "The Association for Computers and the Humanities |". ach.org.
↑ "Historical background", section iv.2 of TEI P5: Guidelines for Electronic Text Encoding and Interchange.
↑ "Closing statement of the Vassar Planning Conference". tei-c.org. 2009. Retrieved 15 April 2012.
↑ "TEI Guidelines" . Retrieved 2010-06-18.
↑ "2". XML Basics. Retrieved 2011-07-09.
↑ "Extensible Markup Language (XML) 1.0 (Fifth Edition)". w3.org.
↑ "P5 version 2.0.1 release notes". tei-c.org. 2012. Retrieved 15 April 2012.
↑ "TEI: Text Encoding Initiative".

External links

TEI Consortium Web site with a list of TEI projects, a form for adding your project Archived 2017-03-05 at the Wayback Machine and wiki
Journal of the TEI Archived 2019-01-18 at the Wayback Machine
TEI Lite: An Introduction to Text Encoding for Interchange
TEI @ Oxford Archived 2021-04-13 at the Wayback Machine (hosted at Oxford University) with development and backup versions of much of the core content.
TEI GitHub site (hosted at GitHub) with repository and issue tracker
Larger list of TEI Projects
What is the TEI? (Introductory overview by Lou Burnard)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Journal of the Text Encoding Initiative". Open Edition Journals. Retrieved 29 June 2022.

[auto-2] 1 2 "TEI element w (word)". tei-c.org.

[3] "TEI element s (s-unit)". tei-c.org.

[4] "TEI element c (character)". tei-c.org.

[5] "TEI element g (character or glyph)". tei-c.org.

[6] "TEI element person (person)". tei-c.org.

[7] "Element choice". www.tei-c.org.

[8] "20 Non-hierarchical Structures - TEI P5: — Guidelines for Electronic Text Encoding and Interchange". tei-c.org. 2019. Retrieved 19 March 2019.

[9] "Samples of TEI texts". wiki.tei-c.org. 2011. Retrieved 17 April 2012.

[10] "17 Simple Analytic Mechanisms - TEI P5: — Guidelines for Electronic Text Encoding and Interchange". tei-c.org. 2012. Retrieved 15 April 2012.

[11] "TEI element lg (groupe de vers)". tei-c.org. 2012. Archived from the original on 6 June 2012. Retrieved 15 April 2012.

[12] "TEI element choice". tei-c.org. 2012. Retrieved 15 April 2012.

[13] Bauman, Syd; Flanders, Julia (2004). ODD customizations. Extreme Markup Languages 2004. Archived from the original on 2012-03-29. Retrieved 2012-04-15.

[14] Burnard, Lou; Rahtz, Sebastian (2004). RelaxNG with Son of ODD. Extreme Markup Languages 2004. Archived from the original on 2012-03-29. Retrieved 2012-04-15.

[15] Reiss, Kevin M. (2007). Literate Documentation for XML (PDF). Digital Humanities 2007. Urbana-Champaign, Illinois. Archived from the original (PDF) on 2016-03-03. Retrieved 2012-04-15.

[16] Burnard, Lou; Rahtz, Sebastian (June 2013). "A complete schema definition language for the Text Encoding Initiative". XML London 2013: 152–161. doi: 10.14337/XMLLondon13.Rahtz01 (inactive 1 November 2024). ISBN 978-0-9926471-0-0.{{cite journal}}: CS1 maint: DOI inactive as of November 2024 (link)

[17] Roma web application

[18] Burnard, Lou; Bauman, Syd, eds. (2007). "TEI P5: Guidelines for Electronic Text Encoding and Interchange". Charlottesville, Virginia, USA: TEI Consortium.

[19] Lieske, Christian; Sasaki, Felix, eds. (3 April 2007). "Internationalization Tag Set (ITS) Version 1.0". World Wide Web Consortium. §1.5 Development of this specification.

[20] Savourel, Yves; Kosek, Jirka; Ishida, Richard, eds. (2008). "Best Practices for XML Internationalization". W3C Working Group. 5.2 ITS and TEI.

[21] "Charters Encoding Initiative - Ludwig-Maximilians-Universität München". www.cei.lmu.de.

[22] "Medieval Nordic Text Archive (Menota)". www.menota.org.

[23] Ahronheim, J.R. (1998). "Descriptive metadata: Emerging standards". Journal of Academic Librarianship. 24 (5): 395–403. doi:10.1016/S0099-1333(98)90079-9.

[24] Cantara, L. (2005). "The text-encoding initiative: Part 1". OCLC Systems & Services. 21 (1): 36–39. doi:10.1108/10650750510578136.

[25] "The Association for Computers and the Humanities |". ach.org.

[26] "Historical background", section iv.2 of TEI P5: Guidelines for Electronic Text Encoding and Interchange.

[27] "Closing statement of the Vassar Planning Conference". tei-c.org. 2009. Retrieved 15 April 2012.

[28] "TEI Guidelines" . Retrieved 2010-06-18.

[29] "2". XML Basics. Retrieved 2011-07-09.

[30] "Extensible Markup Language (XML) 1.0 (Fifth Edition)". w3.org.

[31] "P5 version 2.0.1 release notes". tei-c.org. 2012. Retrieved 15 April 2012.

[32] "TEI: Text Encoding Initiative".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

v t e Document markup languages
Office suite	Compound Document Format OOXML SpreadsheetML PresentationML WordprocessingML ODF UOF
Well-known	HTML XHTML MathML RTF TeX LaTeX Markdown
Lesser-known	AmigaGuide AsciiDoc BBCode CML C-HTML ConTeXt CrossMark DITA DocBook EAD Enriched text FHTML GML GuideML HDML HyTime IPF LilyPond LinuxDoc Lout MIF MAML MEI MusicXML OMDoc OpenMath Org-mode POD ReStructuredText RTML RFT S1000D Setext TEI Texinfo troff Wikitext WML WapTV XAML
List of document markup languages

Authority control databases
International	VIAF
National	Germany 2 United States Czech Republic Norway
Other	IdRef