Text Creation Partnership

Last updated

The Text Creation Partnership (TCP) is a not-for-profit organization based in the library of the University of Michigan since 2000. Its purpose is to produce large-scale full-text electronic resources (especially in the humanities) on behalf of both member institutions (particularly academic libraries) and scholarly publishers, under an arrangement calculated to serve the needs of both, and in so doing to demonstrate the value of a business model that sees corporate and non-profit information-providers as potentially amicable collaborators rather than as antagonistic vendors and customers respectively. [1]

Contents

Projects

TCP has sponsored four text-creation projects to date. The first and the largest is "EEBO-TCP (Phase I)" (2001–2009), an effort to produce structurally marked-up full-text transcriptions of 25,000+ of the roughly 125,000 books to be found either in the Pollard and Redgrave and Wing short-title catalogues of early English printed books, or among the Thomason Tracts, that is, from among nearly all books, pamphlets, and broadsides published in English or in England before 1700. The books were selected and transcribed from the digital scans produced by ProQuest Information and Learning, and distributed by them as a web-based product under the name "Early English Books Online" (EEBO). The scans from which the texts were transcribed were themselves made from the microfilm copies made over the years by ProQuest and its antecedent companies, including the original University Microfilms, Inc. [2] EEBO-TCP Phase I concluded at the end of 2009, having transcribed about 25,300 titles, and immediately moved into EEBO-TCP Phase II (2009–), a sequel project dedicated to converting all the remaining unique English-language monographs (roughly 45,000 additional titles).

The third TCP project was Evans-TCP (2003–2007, with some ongoing work through 2010), an effort to transcribe 6,000 of the 36,000 pre-1800 titles listed in Charles Evans' American Bibliography, and distributed, again as page images scanned from microfilm copies, by Readex, a division of NewsBank, Inc. under the name "Archive of Americana" ("Early American Imprints, series I: Evans, 1639–1800"). Evans-TCP has produced e-texts of nearly 5,000 books.

The final TCP project was ECCO-TCP (2005–2010, with some work ongoing), an effort to transcribe 10,000 eighteenth-century books from among the 136,000 titles available in Thomson-Gale's web-based resource, "Eighteenth-Century Collections Online" (ECCO). ECCO-TCP ran out of funding in 2010 after transcribing about 3,000 (and editing about 2,400) titles.

Project commonalities

All four TCP text projects are very similar. In each case:

  1. The TCP produces text from commercial image files that have in turn been created from microfilm copies of early books.
  2. The commercial image providers receive what is in effect a full-text index to their image product for much less than it would cost to produce themselves: value added to their product.
  3. The partner libraries actually own, rather than simply license, the resultant texts, and are free (subject to some conditions) to mount the texts themselves in whatever system they like, or use the texts internally as a tool of scholarship and teaching.
  4. The texts are created according to library-determined standards, uniform across multiple data-sets and potentially cross-searchable.
  5. Because they are created collaboratively, the texts are relatively inexpensive (on a per-book basis) and become more so with each library that joins the partnership.
  6. The texts will eventually be made freely accessible to the public at large.
  7. The selection of texts to convert, though differing from project to project, in each case follows similar principles: variety, significance, representative quality, avoidance of duplication; specific requests from faculty or scholarly initiatives at member institutions are also generally honored.
  8. TCP has been hitherto primarily interested in creating texts, not in creating a "product"; though texts from all three projects are or will be mounted on servers at the University of Michigan library, the Michigan site is not the official TCP site: any partner library with adequate resources and safeguards may do the same. EEBO-TCP texts, for example, are served by Michigan, ProQuest, the Oxford University Digital Library, and the University of Chicago.

Organization

The TCP is overseen by a Board of Directors, drawn chiefly from senior library administrators at partner institutions, representatives of the corporate partners, and the Council on Library and Information Resources (CLIR). The Board is assisted in matters of selection and scholarship by an academic advisory group that includes faculty in the fields of early modern English and American studies.

The TCP has informal ties to a number of University-based scholarly text projects, especially in helping to provide them with source texts with which to work. Institutions represented include Northwestern University, University of Oxford, Washington University in St. Louis, University of Sydney, University of Toronto, and University of Victoria. TCP has also worked with students by sponsoring an Undergraduate Essay Contest every year, convening task forces on the uses of TCP texts in pedagogy, and appealing to scholars and students for ideas on selection and use.

Text production is managed through the University of Michigan's Digital Library Production Service (DLPS), with its extensive experience in the production of SGML/XML-encoded electronic texts. DLPS is assisted by University of Oxford's Bodleian Digital Libraries Systems & Services (BDLSS), including the late Sebastian Rahtz. Small part-time production operations have also been started within two other libraries: the Centre for Reformation and Renaissance Studies in Pratt Library (Victoria University in the University of Toronto), specializing in Latin books; and the National Library of Wales (Llyfrgell Genedlaethol Cymru) in Aberystwyth, specializing in Welsh books.

Standards

All four TCP text projects are produced in the same way and to the same standards, which are documented, at least in part, on the TCP web site. [3]

  1. Accuracy. The TCP strives to produce texts that are as accurately transcribed as possible, with a specified overall accuracy rate of 99.995% or better (i.e. one error or fewer per 20,000 characters).
  2. Keying. Given the nature of the material, the only method found to deliver such accuracy economically has been to have the books keyed by data conversion firms under contract.
  3. Quality control. Accuracy of transcription and aptness of markup are assessed in all cases by a group of library-based proofers and reviewers managed by the University of Michigan DLPS.
  4. Encoding. All resultant text files are marked up in valid SGML or XML (SGML is archived, XML is exported) conforming to a proprietary "Document Type Description" (DTD) derived from the P3/P4 version of the Text Encoding Initiative (TEI) standard.
  5. Purposeful markup. Compared to the full TEI, the TCP DTD is very simple and intended to capture only the features most useful for intelligible display, intelligent navigation, and productive searching. The TCP practice is to capture, so far as feasible, the overall hierarchical structure of each book (parts, sections, chapters, etc.); the features that tend to mark the beginnings and ends of divisions (headings, explicits, salutations, valedictions, datelines, bylines, epigraphs, etc.); the most significant elements of discourse and organization (paragraphs in prose, lines and stanzas in verse, speeches, speakers, and stage directions in drama, notes, block quotes, sequential numerations of all kinds); and only the most essential aspects of physical formatting (page breaks, lists, tables, font changes).
  6. Fidelity to the original. In each case, the text is intended to represent the book as originally printed, so far as that is possible. Printer's errors are preserved, hand-written changes are ignored, duplicate scans are omitted, out-of-order images are keyed in the intended order, and most of the unusual characters of the original are preserved.
  7. Ease of reading and searching. At the same time, though the transcriptions are carried out character-by-character, TCP, on the theory that all transcription is a kind of translation from one symbolic system to another, tends to define characters in terms more of their meaning than of their form, and to map eccentric letter-forms to meaningful modern equivalents, generally in keeping with the Unicode definition of "character."
  8. Languages. Though most of the TCP texts are in English, many are not. Books and divisions of books not in English are tagged with an appropriate language code, but are not otherwise distinguished.
  9. Omitted material. The TCP produces Latin-alphabet text. Non-textual material such as musical notation, mathematical formulae, and illustrations (except for any text they may contain) are omitted and their locations marked with a special tag. Extended text in non-Latin alphabets (Greek, Hebrew, Persian, etc.) is also omitted.

Accomplishments and prospects

As of April 2011, the TCP had created about 40,000 searchable, navigable, full-text transcriptions of early books, a database of unmatched scope, scale, and utility to students in many fields.[ citation needed ] Whether it will be able to go on to produce the remaining 38,000 texts included in its ambitious recent plans (for EEBO-TCP Phase II) will depend on the validity of its original vision, arising from the theory that libraries could and should cooperate to become producers and standard-setters rather than consumers; and that universities and commercial firms, despite their very different life-cycles, constraints, and motives, could join in durable partnerships of benefit to all parties.

As of Jan 1, 2015, the full text of the EEBO phase I has been released under a Creative Commons License, and can be freely downloaded and distributed.

In 2014 there were 28,466 titles available via Phase II. As of July 2015, ProQuest had the exclusive right for five years to distribute the EEBO-TCP Phase II collection. In 2020 the texts were made freely available to the public. [4]

See also

Related Research Articles

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

<span class="mw-page-title-main">Markup language</span> Modern system for annotating a document

A markuplanguage is a text-encoding system which specifies the structure and formatting of a document and potentially the relationship between its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

The Perseus Digital Library, formerly known as the Perseus Project, is a free-access digital library founded by Gregory Crane in 1987 and hosted by the Department of Classical Studies of Tufts University. One of the pioneers of digital libraries, its self-proclaimed mission is to make the full record of humanity available to everyone. While originally focused on the ancient Greco-Roman world, it has since diversified and offers materials in Arabic, Germanic, English Renaissance literature, 19th century American documents and Italian poetry in Latin, and has sprouted several child projects and international cooperation. The current version, Perseus 4.0, is also known as the Perseus Hopper, and is mirrored by the University of Chicago.

<span class="mw-page-title-main">Wikisource</span> Free online library on a wiki

Wikisource is an online digital library of free-content textual sources on a wiki, operated by the Wikimedia Foundation. Wikisource is the name of the project as a whole and the name for each instance of that project ; multiple Wikisources make up the overall project of Wikisource. The project's aim is to host all forms of free text, in many languages, and translations. Originally conceived as an archive to store useful or important historical texts, it has expanded to become a general-content library. The project officially began on November 24, 2003, under the name Project Sourceberg, a play on Project Gutenberg. The name Wikisource was adopted later that year and it received its own domain name.

<span class="mw-page-title-main">Leningrad Codex</span> 11th-century Hebrew Bible manuscript

The Leningrad Codex is the oldest complete manuscript of the Hebrew Bible in Hebrew, using the Masoretic Text and Tiberian vocalization. According to its colophon, it was made in Cairo in AD 1008.

The National Digital Newspaper Program is a joint project between the National Endowment for the Humanities and the Library of Congress to create and maintain a publicly available, online digital archive of historically significant newspapers published in the United States between 1836 and 1922. Additionally, the program will make available bibliographic records and holdings information for some 140,000 newspaper titles from the 17th century to the present. Further, it will include scope notes and encyclopedia-style entries discussing the historical significance of specific newspapers. Added content will also include contextually relevant historical information. "One organization within each U.S. state or territory will receive an award to collaborate with relevant state partners in this effort." In March 2007 more than 226,000 pages of newspapers from California, Florida, Kentucky, New York, Utah, Virginia and the District of Columbia published between 1900 and 1910 were put online at a fully searchable site called "Chronicling America." As of December 2007, the total number of pages is about 413,000. This further expanded to be 1 million pages in 2009. Funding through the National Endowment for the Humanities is carried out through their "We The People" initiative.

<span class="mw-page-title-main">University of Michigan Library</span> University library system

The University of Michigan Library is the academic library system of the University of Michigan. The university's 38 constituent and affiliated libraries together make it the second largest research library by number of volumes in the United States.

<span class="mw-page-title-main">SciELO</span> Bibliographic database of open access journals

SciELO is a bibliographic database, digital library, and cooperative electronic publishing model of open access journals. SciELO was created to meet the scientific communication needs of developing countries and provides an efficient way to increase visibility and access to scientific literature. Originally established in Brazil in 1997, today there are 16 countries in the SciELO network and its journal collections: Argentina, Bolivia, Brazil, Chile, Colombia, Costa Rica, Cuba, Ecuador, Mexico, Paraguay, Peru, Portugal, South Africa, Spain, Uruguay, and Venezuela.

OmniMark is a fourth-generation programming language used mostly in the publishing industry. It is currently a proprietary software product of Stilo International. As of July 2022, the most recent release of OmniMark was 11.0.

<i>Poetaster</i> (play) Play written by Ben Jonson

Poetaster is a late Elizabethan satirical comedy written by Ben Jonson that was first performed in 1601. The play formed one element in the back-and-forth exchange between Jonson and his rivals John Marston and Thomas Dekker in the so-called Poetomachia or War of the Theatres of 1599–1601.

A Formal Public Identifier (FPI) is a short piece of text with a particular structure that may be used to uniquely identify a product, specification or document. FPIs were introduced as part of Standard Generalized Markup Language (SGML), and serve particular purposes in formats historically derived from SGML. Some of their most common uses are as part of document type declarations (DOCTYPEs) and document type definitions (DTDs) in SGML, XML and historically HTML, but they are also used in the vCard and iCalendar file formats to identify the software product which generated the file.

<i>Colin Clouts Come Home Againe</i> 1595 poem by Edmund Spenser

Colin Clouts Come Home Againe is a pastoral poem by the English poet Edmund Spenser and published in 1595. It has been the focus of little critical attention in comparison with the poet's other works such as The Faerie Queene, yet it has been called the "greatest pastoral eclogue in the English language". In a tradition going back to Petrarch, the pastoral eclogue contains a dialogue between shepherds with a narrative or song as an inset, and which also can conceal allegories of a political or ecclesiastical nature.

Eighteenth Century Collections Online (ECCO) is a digital collection of books published in Great Britain during the 18th century.

<span class="mw-page-title-main">Transcribe Bentham</span> Crowdsourced manuscript transcription project

Transcribe Bentham is a crowdsourced manuscript transcription project, run by University College London's Bentham Project, in partnership with UCL Centre for Digital Humanities, UCL Library Services, UCL Learning and Media Services, the University of London Computer Centre, and the online community. Transcribe Bentham was launched under a twelve-month Arts and Humanities Research Council grant.

The English Broadside Ballad Archive (EBBA) is a digital library of 17th-century English Broadside Ballads, a project of the English Department of the University of California, Santa Barbara. The project archives ballads in multiple accessible digital formats.

References

  1. Blumenstyk, Goldie (August 10, 2001). "A Project Seeks to Digitize Thousands of Early English Texts". Chronicle of Higher Education: A47. Retrieved 2007-01-04.
  2. Beamish, Rita (July 29, 1999). "Online Archive Will Preserve Earliest English Books". New York Times. Retrieved 2007-01-04.
  3. "Production files". Text Creation Partnership. Retrieved 2020-03-12.
  4. "Frequently asked questions". Text Creation Partnership. University of Michigan Library. Retrieved 1 May 2024.