E-text

Last updated

e-text (from " electronic text"; sometimes written as etext) is a general term for any document that is read in digital form, and especially a document that is mainly text. For example, a computer based book of art with minimal text, or a set of photographs or scans of pages, would not usually be called an "e-text". An e-text may be a binary or a plain text file, viewed with any open source or proprietary software. An e-text may have markup or other formatting information, or not. An e-text may be an electronic edition of a work originally composed or published in other media, or may be created in electronic form originally. The term is usually synonymous with e-book.

Contents

E-text origins

E-texts, or electronic documents, have been around since long before the Internet, the Web, and specialized E-book reading hardware. Roberto Busa began developing an electronic edition of Aquinas in the 1940s, while large-scale electronic text editing, hypertext, and online reading platforms such as Augment and FRESS appeared in the 1960s. These early systems made extensive use of formatting, markup, automatic tables of contents, hyperlinks, and other information in their texts, as well as in some cases (such as FRESS) supporting not just text but also graphics. [1]

"Just plain text"

In some communities, "e-text" is used much more narrowly, to refer to electronic documents that are, so to speak, "plain vanilla ASCII". By this is meant not only that the document is a plain text file, but that it has no information beyond "the text itself"—no representation of bold or italics, paragraph, page, chapter, or footnote boundaries, etc. Michael S. Hart, [2] for example, argued that this "is the only text mode that is easy on both the eyes and the computer". Hart made the correct[ according to whom? ] point that proprietary word-processor formats made texts grossly inaccessible; but that is irrelevant to standard, open data formats. The narrow sense of "e-text" is now uncommon, because the notion of "just vanilla ASCII" (attractive at first glance), has turned out to have serious difficulties:

First, this narrow type of "e-text" is limited to the English letters. Not even Spanish ñ or the accented vowels used in many European languages cannot be represented (unless awkwardly and ambiguously as "~n" "a'"). Asian, Slavic, Greek, and other writing systems are impossible.

Second, diagrams and pictures cannot be accommodated, and many books have at least some such material; often it is essential to the book.

Third, "e-texts" in this narrow sense have no reliable way to distinguish "the text" from other things that occur in a work. For example, page numbers, page headers, and footnotes might be omitted, or might simply appear as additional lines of text, perhaps with blank lines before and after (or not). An ornate separator line might be represented instead by a line of asterisks (or not). Chapter and sections titles, likewise, are just additional lines of text: they might be detectable by capitalization if they were all caps in the original (or not). Even to discover what conventions (if any) were used, makes each book a new research or reverse-engineering project.

In consequence of this, such texts cannot be reliably re-formatted. A program cannot reliably tell where footnotes, headers or footers are, or perhaps even paragraphs, so it cannot re-arrange the text, for example to fit a narrower screen, or read it aloud for the visually impaired. Programs might apply heuristics to guess at the structure, but this can easily fail.

Fourth, and a perhaps surprisingly[ according to whom? ] important issue, a "plain-text" e-text affords no way to represent information about the work. For example, is it the first or the tenth edition? Who prepared it, and what rights do they reserve or grant to others? Is this the raw version straight off a scanner, or has it been proofread and corrected? Metadata relating to the text is sometimes included with an e-text, but there is by this definition no way to say whether or where it is preset. At best, the text of the title page might be included (or not), perhaps with centering imitated by indentation.

Fifth, texts with more complicated information cannot really be handled at all. A bilingual edition, or a critical edition with footnotes, commentary, critical apparatus, cross-references, or even the simplest tables. This leads to endless practical problems: for example, if the computer cannot reliably distinguish footnotes, it cannot find a phrase that a footnote interrupts.

Even raw scanner OCR output usually produces more information than this, such as the use of bold and italic. If this information is not kept, it is expensive and time-consuming to reconstruct it; more sophisticated information such as what edition you have, may not be recoverable at all.

If actuality, even "plain text" uses some kind of "markup"—usually control characters, spaces, tabs, and the like: Spaces between words; two returns and 5 spaces for paragraph. The main difference from more formal markup is that "plain texts" use implicit, usually undocumented conventions, which are therefore inconsistent and difficult to recognize. [3]

The narrow sense of e-text as "plain vanilla ASCII" has fallen out of favor.[ according to whom? ] Nevertheless, many such texts are freely available on the Web, perhaps as much because they are easily produced as because of any purported portability advantage. For many years Project Gutenberg strongly favored this model of text, but with time, has begun to develop and distribute more capable forms such as HTML.

See also

Related Research Articles

Hypertext Text with references (links) to other text that the reader can immediately access

Hypertext is text displayed on a computer display or other electronic devices with references (hyperlinks) to other text that the reader can immediately access. Hypertext documents are interconnected by hyperlinks, which are typically activated by a mouse click, keypress set, or screen touch. Apart from text, the term "hypertext" is also sometimes used to describe tables, images, and other presentational content formats with integrated hyperlinks. Hypertext is one of the key underlying concepts of the World Wide Web, where Web pages are often written in the Hypertext Markup Language (HTML). As implemented on the Web, hypertext enables the easy-to-use publication of information over the Internet.

Markup language Modern system for annotating a document

In computer text processing, a markup language is a system for annotating a document in a way that is visually distinguishable from the content. It is used only to format the text, so that when the document is processed for display, the markup language does not appear. The idea and terminology evolved from the "marking up" of paper manuscripts, which is traditionally written with a red pen or blue pencil on authors' manuscripts. Such "markup" typically includes both content corrections, and also typographic instructions, such as to make a heading larger or boldface.

Project Gutenberg Online digital book library

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, as well as to "encourage the creation and distribution of eBooks." It was founded in 1971 by American writer Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of books in the public domain. The Project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 22 May 2021, Project Gutenberg had reached 65,405 items in its collection of free eBooks.

Plain text Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

The Rich Text Format is a proprietary document file format with published specification developed by Microsoft Corporation from 1987 until 2008 for cross-platform document interchange with Microsoft products. Prior to 2008, Microsoft published updated specifications for RTF with major revisions of Microsoft Word and Office versions.

Text editor Computer software used to edit plain text documents

A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software, following the naming of Microsoft Notepad. Text editors are provided with operating systems and software development packages, and can be used to change files such as configuration files, documentation files and programming language source code.

XML Markup language developed by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

Human-readable medium representation of data or information that can be naturally read by humans

A human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans.

An underscore, also called an underline, low line, or low dash, is a line drawn under a segment of text. Underscoring/underlining is a proofreading convention that says "set this text in italic type", traditionally used on manuscript or typescript as an instruction to the printer. The underscore character, _, originally appeared on the typewriter and was primarily used to emphasise words as in the proofreader's convention. To produce an underscored word, the word was typed, the typewriter carriage was moved back to the beginning of the word, and the word was overtyped with the underscore character.

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data in plain text, in which case each line will have the same number of fields.

reStructuredText is a file format for textual data used primarily in the Python programming language community for technical documentation.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

Formatted text, styled text, or rich text, as opposed to plain text, has styling information beyond the minimum of semantic elements: colours, styles, sizes, and special features in HTML.

Page layout Part of graphic design that deals in the arrangement of visual elements on a page

In graphic design, page layout is the arrangement of visual elements on a page. It generally involves organizational principles of composition to achieve specific communication objectives.

Data conversion is the conversion of computer data from one format to another. Throughout a computer environment, data is encoded in a variety of ways. For example, computer hardware is built on the basis of certain standards, which requires that data contains, for example, parity bit checks. Similarly, the operating system is predicated on certain standards for data and file handling. Furthermore, each computer program handles data in a different manner. Whenever any one of these variables is changed, data must be converted in some way before it can be used by a different computer, operating system or program. Even different versions of these elements usually involve different data structures. For example, the changing of bits from one format to another, usually for the purpose of application interoperability or of capability of using new features, is merely a data conversion. Data conversions may be as simple as the conversion of a text file from one character encoding system to another; or more complex, such as the conversion of office file formats, or the conversion of image formats and audio file formats.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is web indexing.

The following is a comparison of e-book formats used to create and publish e-books.

A structured document is an electronic document where some method of markup is used to identify the whole and parts of the document as having various meanings beyond their formatting. For example, a structured document might identify a certain portion as a "chapter title" rather than as "Helvetica bold 24" or "indented Courier". Such portions in general are commonly called "components" or "elements" of a document.

File Retrieval and Editing System

The File Retrieval and Editing SyStem, or FRESS, was a hypertext system developed at Brown University starting in 1968 by Andries van Dam and his students, including Bob Wallace. It was the first hypertext system to run on readily available commercial hardware and OS. It is also possibly the first computer-based system to have had an "undo" feature for quickly correcting small editing or navigational mistakes.

References

  1. Reading and Writing the Electronic Book. Nicole Yankelovich, Norman Meyrowitz, and Andries van Dam. IEEE Computer 18(10), October 1985. http://dl.acm.org/citation.cfm?id=4407
  2. Michael S. Hart
  3. Coombs, James H.; Renear, Allen H.; DeRose, Steven J. (November 1987). "Markup systems and the future of scholarly text processing". Communications of the ACM . ACM. 30 (11): 933–947. doi:10.1145/32206.32209. S2CID   59941802.