In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations. [1] [2]
The problem of non-hierarchical structures in documents has been recognised since 1988; resolving it against the dominant paradigm of text as a single hierarchy (an ordered hierarchy of content objects or OHCO) was initially thought to be merely a technical issue, but has, in fact, proven much more difficult. [4] In 2008, Jeni Tennison identified markup overlap as "the main remaining problem area for markup technologists". [5] Markup overlap continues to be a primary issue in the digital study of theological texts in 2019, and is a major reason for the field retaining specialised markup formats—the Open Scripture Information Standard and the Theological Markup Language —rather than the inter-operable Text Encoding Initiative-based formats common to the rest of the digital humanities. [6]
A distinction exists between schemes that allow non-contiguous overlap, and those that allow only contiguous overlap. Often, 'markup overlap' strictly means the latter. Contiguous overlap can always be represented as a linear document with milestones (typically co-indexed start- and end-markers), without the need for fragmenting a (logical) component into multiple physical ones. Non-contiguous overlap may require document fragmentation. Another distinction in overlapping markup schemes is whether elements can overlap with other elements of the same kind (self-overlap). [2]
A scheme may have a privileged hierarchy. Some XML-based schemes, for example, represent one hierarchy directly in the XML document tree, and represent other, overlapping, structures by another means; these are said to be non-privileged.
Schmidt (2012) identifies a tripartite classification of instances of overlap: 1. "Variation of content and structure", 2. "Overlay of multiple perspectives or markup sets", and 3. "Overlap of individual start and end tags within a single markup perspective"; additionally, some apparent instances of overlap are in fact schema definition problems, which can be resolved hierarchically. He contends that type 1 is best resolved by a system of multiple documents external to the markup, but types 2 and 3 require dealing with internally.
DeRose (2004 , Evaluation criteria) identifies several criteria for judging solutions to the overlap problem:
Tag soup is, strictly speaking, not overlapping markup—it is malformed HTML, which is a non-overlapping language, and may be ill-defined. Some web browsers attempted to represent overlapping start and end tags with non-hierarchical Document Object Models (DOM), but this was not standardised across all browsers and was incompatible with the innately hierarchical nature of the DOM. [7] [8] HTML5 defines how processors should deal with such mis-nested markup in the HTML syntax and turn it into a single hierarchy. [9] With XHTML and SGML-based HTML, however, mis-nested markup is a strict error and makes processing by standards-compliant systems impossible. [10] The HTML standard defines a paragraph concept which can cause overlap with other elements and can be non-contiguous. [11]
SGML, which early versions of HTML were based on, has a feature called CONCUR that allows multiple independent hierarchies to co-exist without privileging any. DTD validation is only defined for each individual hierarchy with CONCUR. Validation across hierarchies is not defined by the standard. CONCUR cannot support self-overlap, and it interacts poorly with some of SGML's abbreviatory features. This feature has been poorly supported by tools and has seen very little actual use; using CONCUR to represent document overlap was not a recommended use case, according to a commentary by the standard's editor. [12] [13]
There are several approaches to representing overlap in a non-overlapping language. [14] The Text Encoding Initiative, as an XML-based markup scheme, cannot directly represent overlapping markup. All four of the below approaches are suggested. [15] The Open Scripture Information Standard is another XML-based scheme, designed to mark up the Bible. It uses empty milestone elements to encode non-privileged components. [16]
To illustrate these approaches, marking up the sentences and lines of a fragment of Richard III by William Shakespeare will be used as a running example. Where there is a privileged hierarchy, the lines will be used.
Multiple documents can each provide different internally consistent hierarchies. The advantage of this approach is that each document is simple and can be processed with existing tools, but requires maintenance of redundant content and it can be difficult to cross-reference between different views. [17] With multiple documents, the overlap can be analysed with data comparison and delta encoding techniques, and, in an XML context, specific XML tree differencing algorithms are available. [18] [19]
Schmidt (2012 , 3.5 Variation) recommends this approach for encoding multiple variants of a single text and to accept the duplication of the parts which do not vary, rather than attempting to create a structure that represents all of the variation present; further, he suggests that this alignment be performed automatically, and that misalignment is rare in practice. [20]
Example, with lines marked up:
<line>I,byattorney,blesstheefromthymother,</line><line>WhoprayscontinuallyforRichmond'sgood.</line><line>Somuchforthat.—Thesilenthoursstealon,</line><line>Andflakydarknessbreakswithintheeast.</line>
With sentences marked up:
<sentence>I,byattorney,blesstheefromthymother, WhoprayscontinuallyforRichmond'sgood.</sentence><sentence>Somuchforthat.</sentence><sentence>—Thesilenthoursstealon, Andflakydarknessbreakswithintheeast.</sentence>
Milestones are empty elements that mark the beginning and end of a component, typically using the XML ID mechanism to indicate which "begin" element goes with which "end" element. Milestones can be used to embed a non-privileged structure within a hierarchical language, In their basic form they can only represent contiguous overlap. Generic XML can of course parse the milestone elements, but do not understand their special meaning and so cannot easily process or validate the non-privileged structure. [21] [22]
Milestone have the advantage that the markup for overlapping elements is located right at the relevant boundaries, like other markup. This is an advantage for maintainability and readability. [23] CLIX ( DeRose 2004 ) is an example of such an approach.
Example:
<line><sentence-start/>I,byattorney,blesstheefromthymother,</line><line>WhoprayscontinuallyforRichmond'sgood.<sentence-end/></line><line><sentence-start/>Somuchforthat.<sentence-end/><sentence-start/>—Thesilenthoursstealon,</line><line>Andflakydarknessbreakswithintheeast.<sentence-end/></line>
Punctuation and spaces have been identified as a type of milestone-style 'crypto-overlap' or 'pseudo-markup', as the boundaries of words, clauses, sentences and the like do not necessarily align with the formal markup boundaries hierarchically. [24] [25]
It is also possible to use more complex milestones to represent non-contiguous structures. For example, TAGML's "suspend" and "resume" semantic [26] can be expressed using milestones, for example by adding an attribute to indicate whether each milestone represents a start, suspend, resume, or end point. Re-ordering and even self-overlap can be achieved similarly, by annotating each milestone with a "next chunk" reference.
Joins are pointers within a privileged hierarchy to other components of the privileged hierarchy, which may be used to reconstruct a non-privileged component akin to following a linked list. A single non-privileged element is segmented into several partial elements within the privileged hierarchy; the partial elements themselves do not represent a single unit in the non-privileged hierarchy, which can be misleading and make processing difficult. [27] [28] While this approach can support some discontiguous structures, it is not able to re-order elements. [29] A slightly different approach can, however, express re-ordering by expressing the join away from the content, at the cost of directness and maintainability. [30]
Join-based representations can introduce the possibility of cycles between elements; detecting and rejecting these adds complexity to implementations. [31]
Example:
<line><sentenceid="a">I,byattorney,blesstheefromthymother,</sentence></line><line><sentencecontinues="a">WhoprayscontinuallyforRichmond'sgood.</sentence></line><line><sentenceid="b">Somuchforthat.</sentence><sentenceid="c">—Thesilenthoursstealon,</sentence></line><line><sentencecontinues="c">Andflakydarknessbreakswithintheeast.</sentence></line>
Stand-off markup is similar to using joins, except that there may be no privileged hierarchy: each part of the document is given a label (or might be referred to by an offset), and the document structure is expressed by pointing to the content from markup that 'stands off' from the content (possibly in an entirely different file), and might contain no content itself. The TEI guidelines identify the unity of the elements as a primary advantage of stand-off markup over joins, in addition to the ability to produce and distribute annotations separately from the text, possibly even by different authors applying markup to a read-only document, [32] allowing collaborative approaches to markup by a divide and conquer strategy. [33]
Example:
<spanid="a">I,byattorney,blesstheefromthymother,</span><spanid="b">WhoprayscontinuallyforRichmond'sgood.</span><spanid="c">Somuchforthat.</span><spanid="d">—Thesilenthoursstealon,</span><spanid="e">Andflakydarknessbreakswithintheeast.</span>... <linecontents="a"/><linecontents="b"/><linecontents="c d"/><linecontents="e"/><sentencecontents="a b"/><sentencecontents="c"/><sentencecontents="d e"/>
It has been claimed that separating markup and text can result in overall simplification and increased maintainability, [34] and by 2017, ``[t]he current state of the art to [represent] (...) linguistically annotated data is to use a graph-based representation serialized as standoff XML as a pivot format´´, [35] i.e., that standoff was the most widely accepted approach to address the overlapping markup challenge.
Standoff formalisms have been the basis for an ISO standard for linguistic annotation, [36] they have been successfully applied for developing corpus management systems, [37] and (as of April 2020) they are actively being developed in the TEI. [38] One published example of a successful stand-off annotation scheme was developed as part of a bitext natural language documentation project focused on the preservation of low-resource or endangered languages. [39]
Representing overlapping markup within hierarchical languages is challenging, for reasons of redundancy and/or complexity. In the 2000s to 2010s, standoff formalisms were generally accepted as the most promising approach here, [35] but a disadvantage of standoff is that validation is very challenging. [40] Standoff formalisms are not natively supported by database management systems, so that (by 2017) it was suggested to ``use ... standoff XML as a pivot format (...) and relational data bases for querying.´´ [35] In practical applications, this requires complicated architectures and/or labor-intense transformation between pivot format and internal representation. As a result, maintenance is problematic. [41] This has been a motivation to develop corpus management systems on the basis of graph data bases and for using established graph-based formalisms as pivot formats.
For implementing the above-mentioned strategies, either existing markup languages (such as the TEI) can be extended or special-purpose languages can be designed. To design an entirely new markup language allow to forego[ incomprehensible ] the tool support in existing languages for a less complicated semantic model and more convenient syntax.
None of these formalisms seem to be maintained anymore. Consensus community seems to be to employ standoff XML or graph-based formalisms.
Standoff approaches have two parts, commonly called the "content" and the "annotations." These can be expressed in unrelated representations. Simple standoff annotations per se, involve no more than a list of (location, type) pairs. Thus, in a few applications[ example needed ] standoff annotations are expressed in CSV, JSON(-LD, or other representations. (e.g., Web Annotation [61] ) or graph formalisms grounded in string URIs (see below). However, representing and validating content in such representations is much more difficult and much less common.
Standoff markup employs a data model based on directed graphs, [62] thus complicating its representation when grounding markup information in a tree. Representing overlapping hierarchies in a graph eliminates this challenge. Standoff annotations can thus be more adequately represented as generalised directed multigraphs and use formalisms and technologies developed for this purpose, most notably those based on the Resource Description Framework (RDF). [63] [64] EARMARK is an early RDF/OWL representation that encompasses General Ordered-Descendant Directed Acyclic Graphs (GODDAGs). [14] The theory of GODDAGs, while not strictly a markup language itself, is a general data model for non-hierarchical markup.
RDF is a semantic data model that is linearization-independent, and it provides different linearisations, including an XML format (RDF/XML) that can be modeled to mirror standoff XML, a linearisation that lets RDF be expressed in XML attributes (RDFa), a JSON format (JSON-LD), and binary formats designed to facilitate querying or processing (RDF-HDT, [65] RDF-Thrift [66] ). RDF is semantically equivalent to graph-based data models underlying standoff markup; it does not require special-purpose technology for storing, parsing and querying. Multiple interlinked RDF files representing a document or a corpus constitute an example of Linguistic Linked Open Data.
An established technique to link arbitrary graphs with an annotated document is to use URI fragment identifiers to refer to parts of a text and/or document, see overview under Web annotation. The Web Annotation standard provides format-specific `selectors' as an additional means, e.g., offset-, string-match- or XPath-based selectors. [67]
Native RDF vocabularies capable to represent linguistic annotations include: [68]
Related vocabularies include
In early 2020, W3C Community Group LD4LT has launched an initiative to harmonize these vocabularies and to develop a consolidated RDF vocabulary for linguistic annotations on the web. [74]
{{cite journal}}
: Cite journal requires |journal=
(help)A markuplanguage is a text-encoding system which specifies the structure and formatting of a document and potentially the relationship between its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.
In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.
The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":
The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content, and is one of a number of mathematical markup languages. Its aim is to natively integrate mathematical formulae into World Wide Web pages and other documents. It is part of HTML5 and standardised by ISO/IEC since 2015.
The Geography Markup Language (GML) is the XML grammar defined by the Open Geospatial Consortium (OGC) to express geographical features. GML serves as a modeling language for geographic systems as well as an open interchange format for geographic transactions on the Internet. Key to GML's utility is its ability to integrate all forms of geographic information, including not only conventional "vector" or discrete objects, but coverages and sensor data.
The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.
RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.
In computing, Terse RDF Triple Language (Turtle) is a syntax and file format for expressing data in the Resource Description Framework (RDF) data model. Turtle syntax is similar to that of SPARQL, an RDF query language. It is a common data format for storing RDF data, along with N-Triples, JSON-LD and RDF/XML.
C. Michael Sperberg-McQueen is an American markup language specialist. He was co-editor of the Extensible Markup Language (XML) 1.0 spec (1998), and chair of the XML Schema working group.
The Semantic Web Stack, also known as Semantic Web Cake or Semantic Web Layer Cake, illustrates the architecture of the Semantic Web.
MECS is the Multi-Element Code System, a markup system developed by the Wittgenstein Archives at the University of Bergen. It is very similar to SGML and XML except that it allows elements to overlap.
The Music Encoding Initiative (MEI) is an open-source effort to create a system for representation of musical documents in a machine-readable structure. MEI closely mirrors work done by text scholars in the Text Encoding Initiative (TEI) and while the two encoding initiatives are not formally related, they share many common characteristics and development practices. The term "MEI", like "TEI", describes the governing organization and the markup language. The MEI community solicits input and development directions from specialists in various music research communities, including technologists, librarians, historians, and theorists in a common effort to discuss and define best practices for representing a broad range of musical documents and structures. The results of these discussions are then formalized into the MEI schema, a core set of rules for recording physical and intellectual characteristics of music notation documents. This schema is expressed in an XML schema Language, with RelaxNG being the preferred format. The MEI schema is developed using the One-Document-Does-it-all (ODD) format, a literate programming XML format developed by the Text Encoding Initiative.
XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. XHTML+RDFa is one of the techniques used to develop Semantic Web content by embedding rich semantic markup. Version 1.1 of the language is a superset of XHTML 1.1, integrating the attributes according to RDFa Core 1.1. In other words, it is an RDFa support through XHTML Modularization.
Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.
In natural language processing, linguistics, and neighboring fields, Linguistic Linked Open Data (LLOD) describes a method and an interdisciplinary community concerned with creating, sharing, and (re-)using language resources in accordance with Linked Data principles. The Linguistic Linked Open Data Cloud was conceived and is being maintained by the Open Linguistics Working Group (OWLG) of the Open Knowledge Foundation, but has been a point of focal activity for several W3C community groups, research projects, and infrastructure efforts since then.
Drama annotation is the process of annotating the metadata of a drama. Given a drama expressed in some medium, the process of metadata annotation identifies what are the elements that characterize the drama and annotates such elements in some metadata format. For example, in the sentence "Laertes and Polonius warn Ophelia to stay away from Hamlet." from the text Hamlet, the word "Laertes", which refers to a drama element, namely a character, will be annotated as "Char", taken from some set of metadata. This article addresses the drama annotation projects, with the sets of metadata and annotations proposed in the scientific literature, based markup languages and ontologies.