OmniMark

Last updated

OmniMark is a fourth-generation programming language used mostly in the publishing industry. It is currently a proprietary software product of Stilo International. As of July 2022, the most recent release [1] of OmniMark was 11.0.

Contents

Usage

OmniMark is used to process data, and convert it from one format to another, using a streaming architecture [2] that allows it to handle large volumes of content sequentially without having to keep it all in memory. It has a built-in XML parser, and support for XQuery via integration with Sedna native XML database. It also has features to process find rules which implement a similar concept to regular expressions, although the pattern expression syntax is more English-like than the regular expression syntax used in Perl and other languages like the Ruby programming language, both of which are more widely used than OmniMark. OmniMark can also be used for schema transformation tasks in the same way as XSLT, but supports switching between procedural and functional code without the need for any additional constructs to support the procedural elements.

History

OmniMark was originally created in the 1980s by Exoterica, a Canadian software company, as a SGML processing program called XTRAN. [3] XTRAN was later renamed OmniMark, and Exoterica became OmniMark Technologies. The current owners of OmniMark, Stilo International, have their main offices in the UK but also maintain an office in Canada. [4]

In 1999, OmniMark president and CEO John McFadden announced that OmniMark 5 would be available free of charge, to better compete with Perl. [5] OmniMark is no longer distributed under such a model.

Programming model

OmniMark treats input as a flow that can be scanned once, rather than as a static collection of data that supports random access. Much of an OmniMark program is in the form of condition=>action rule where the condition recognizes a length of data to be acted upon and the action specifies what is to be done with the data. There are two kinds of condition:

Processing unstructured input

Find rules are used to apply patterns to unstructured input. Lengths of text are recognized by a pattern that includes temporary pattern variables to capture any part of the text that will be needed in the output. The action uses those variables to produce the required output:

; Change prices from English format to French formatfind"$"digit+=>dollars"."digit{2}=>cents; output the price in new formatoutputdollars||","||cents||"$"

If two find rules can recognize the same sequence of text, the first rule will “eat” the sequence and the second rule will never see the text. Input that is not recognized by any find rule does not get “eaten” and passes right through to the output.

Processing structured input (XML, SGML)

OmniMark sees input as a flow; a program does not hold input data in memory unless part of the data has been saved in variables. As the input flows by, OmniMark maintains an element stack containing information that can be used to guide the transformation of text via the OmniMark pattern-matching facility. When each start tag is encountered, OmniMark pushes another element description on the stack. The element description includes the element name, the attribute names with the types and values of the attributes, along with other information from the parser (such as whether that element is an EMPTY element). When the corresponding end tag is encountered, the element description is popped from the top of stack. With SGML, some tags may be omitted, but OmniMark acts as if the tags were present and in the right places.

OmniMark element stack

                       content                    <h1>   .   </h1><body>    |  .  |     </body><example>      |   |  .  |    |       </example>              |     |   |  .  |    |      |              |     |   |  .  |    |      |              |     |   |  .  |    |      |              A     B   C  .  D    E      F                           X      X: current position in the input document      Scan        Available     Location    Information     A to F      element example     B to E      elements example, body     C to D      elements example, body, h1     C           beginning of content          D      end of content 

An OmniMark program uses element rules to process XML or SGML documents. An element rule:

  • gets control just after the start tag has been parsed, and the element description has been pushed on the element stack. The action for the element rule has access to the descriptions of the current element and all the ancestor elements s back to the document root.
  • passes control back to the parser by requesting the parsed content of the element in the special value "%c". The content is usually requested for scanning with pattern matching, rather than for storage in a variable.
  • gets control again when the corresponding end tag has been parsed, but before the element description is popped from the element stack. The action for the element rule still has access to the descriptions of the current element and all the ancestor elements back to the document root.

Since elements can be nested, several element rules can be in play at the same time, each with a corresponding element description on the element stack. Element rules are suspended while waiting for the parser to finish parsing their content. Only the rule for the element at top of stack can be active. When end of content is reached for the element at top of stack, the action for the corresponding element rule gets control again. When that action exits, the element description is popped and control is returned to the action for the next lower element on the stack. An element rule might simply output the parsed content (as text) and append a suffix:

element"code"output"%c"; parse and output element contentdowhenparentisnt("h1"|"h2"|"h3"|"h4"|"h5"|"h6")output"%n"; append a newline if not in a headingdone

A program does not need to name all of the document elements if the unnamed elements can be given some kind of generic processing:

element#implieddowhenparentis"head"suppress; discard child elementselseoutput"%c"; parse and output element contentdone

Pattern matching on output from the parser

The parsed content of each element is made available within an element rule and can be modified by a repeat ... scan block that uses patterns to identify the text to be modified:

element"p"    ; Change prices from English format to French formatrepeatscan"%c"    ; parse and scan element contentmatch"$"digit+=>dollars"."digit{2}=>cents            ; output the price in new formatoutputdollars||","||cents||"$"match(anyexcept"$")+=>text            ; output non-price sequences without changeoutputtextmatch"$"=>text            ; output isolated currency symbol without changeoutputtextdone

The first pattern that matches a leading part of the text will “eat” that text and the text will not be available to the following patterns even if one of the following parts could match a longer leading part of the text. Any leading part that does not match one of the patterns in a repeat ... scan block will be discarded.

Pattern matching on input to the parser

Translate rules get control just after tags have been separated from text but before the completion of parsing. Each translate rule has a pattern that identifies a length of text to be processed. That length of text will not include any tags, but could be as much as the full length of text between two tags.

One use of translate rules is to make a specific change throughout an entire document:

; Change markup character to entity that represented it in the inputtranslate"&"output"&amp;"

The tags before the current point in the input have already gone through the parser, so the element stack already has a description of the element (or nested elements) that contain the text. Consequently, the information on the element stack can be used to control what is done with the text For example, the operation of a translate can be limited to the character content of one or more elements:

; Change prices from English format to French formattranslate"$"digit+=>dollars"."digit{2}=>centswhenelementis("p"|"code")    ; output the price in new formatoutputdollars||","||cents||"$"

Example code

In some applications, much of a document can be handled by a well-designed generic action, so that only a fraction of the document needs special handling. This can greatly reduce the size and complexity of a program and, in the case of XML documents, can make a program very tolerant of changes in the structure of the input document.

A simple program

This is the basic "Hello, World!" program:

 process     output "Hello World!" 

Unstructured input (text)

This program outputs all words that begin with a capital letter, one word per line, and discards all other text:

processsubmitfile"myfile.txt"; or submit "ANY Text discard lowercase words"; output capitalized word, append a newlinefind(ucletter*)=>tempoutputtemp||"%n"; discard all other charactersfindany; no output

Structured input (XML)

OmniMark can accept well-formed XML, valid XML or SGML as structured input. This program outputs a list of first- and second-level headings from an xhtml file, indenting the second-level headings:

; xhtml-headings.xom; List first- and second-level headings from xhtml or xhtml5 file; Second-level headings are indentedprocess; transform the input document; do xml-parse document   ; parse valid XMLdoxml-parse; parse well-formed XMLscanfile"example.html"output"%c"; parse and output document contentdoneelement"head"suppress; discard child elementselement"h1"output"%c"; parse and output element contentoutput"%n"; add a line-endelement"h2"output"  "; indent 2 spacesoutput"%c"; parse and output element contentoutput"%n"; add a line-end; handle any element not named in explicit rules aboveelement#implieddowhenparentis"body"; discard all child elements except those named abovesuppress; discard child elementselse; keep the content of any other elementoutput"%c"; parse and output element contentdone; discard character content from element "body" if that element; has mixed contenttranslateany+=>Xwhenelementisbody; no output (do nothing with variable "X")

The element #implied rule picks up any element that is not recognized by one of the other element rules.

Structured input (SGML)

This program replaces the omitted tags in a simple SGML document and outputs something similar to well-formed XML. The program does not translate SGML empty tags correctly to XML empty tags and it does not handle many of the SGML features that can be used in SGML documents.

Program

; Insert omitted tags in SGML document;; This program is simplified, for demonstration only,; The program does not handle many features of SGML; A more elaborate program would be required to produce; well-formed XML from most SGML documents.processdosgml-parsedocumentscanfile"example.sgml"output"%c"; parse and output document contentdoneelement#impliedoutput"<%q"; begin start tag; write attributes as name="value" pairsrepeatoverspecifiedattributesasattroutput" "||keyofattributeattr||"=%"%v(attr)%""againoutput">"; terminate start tag; write element contentoutput"%c"; write end tag if element allows contentoutput"</%q>"unlesscontentis(empty|conref); translate markup characters (in text) back to the entities; that represented them in the original inputtranslate"&"output"&amp;"translate"<"output"&lt;"translate">"output"&gt;"

Example input

<!-- A simple SGML document for input to OmniMark demos --><!DOCTYPE example [  <!ELEMENT example   O - (head, body)><!ELEMENT head      O O (title?)><!ELEMENT title     - - (#PCDATA)><!ELEMENT body      - O ((empty|p)*)><!ELEMENT empty     - O EMPTY><!ELEMENT p         - O (#PCDATA)><!ATTLIST P       id    ID    #IMPLIED><!ENTITY  amp     "&"><!ENTITY  lt      "<"><!ENTITY  gt      ">"> ]> <example><title>Title</title><body><p>Text <empty><pid="P-2">&lt;&amp;&gt;</example>

Example output

<EXAMPLE><HEAD><TITLE>Title</TITLE></HEAD><BODY><P>Text</P><EMPTY><PID="P-2">&lt;&amp;&gt;</P></BODY></EXAMPLE>

Further reading

Related Research Articles

A document type definition (DTD) is a set of markup declarations that define a document type for an SGML-family markup language.

HTML Hypertext Markup Language

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

Markup language Modern system for annotating a document

Markup refers to data included in an electronic document which is distinct from the document's content in that it is typically not included in representations of the document for end users, for example on paper or a computer screen, or in an audio stream. Markup is often used to control the display of the document or to enrich its content to facilitate automated processing. A markup language is a set of rules governing what markup information may be included in a document and how it is combined with the content of the document in a way to facilitate use by humans and computer programs. The idea and terminology evolved from the "marking up" of paper manuscripts, which is traditionally written with a red pen or blue pencil on authors' manuscripts.

Standard Generalized Markup Language Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

XML Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation.

SAX is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole—building the full abstract syntax tree of an XML document for convenience of the user—SAX parsers operate on each piece of the XML document sequentially, issuing parsing events while making a single pass through the input stream.

An HTML element is a type of HTML document component, one of several types of HTML nodes. HTML document is composed of a tree of simple HTML nodes, such as text nodes, and HTML elements, which add semantics and formatting to parts of document. Each element can have HTML attributes specified. Elements can also have content, including other elements and text.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

In web development, "tag soup" is a pejorative for syntactically or structurally incorrect HTML written for a web page. Because web browsers have historically treated HTML syntax or structural errors leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

Text Encoding Initiative Academic community concerned with practices for semantic markup of texts

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal, a wiki, a GitHub repository and a toolchain.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

In computer science, canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

In the Standard Generalized Markup Language (SGML), an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word. Entities are foundational to the organizational structure and definition of SGML documents. The SGML specification defines numerous entity types, which are distinguished by keyword qualifiers and context. An entity string value may variously consist of plain text, SGML tags, and/or references to previously defined entities. Certain entity types may also invoke external documents. Entities are called by reference.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

A Canonical S-expression is a binary encoding form of a subset of general S-expression. It was designed for use in SPKI to retain the power of S-expressions and ensure canonical form for applications such as digital signatures while achieving the compactness of a binary form and maximizing the speed of parsing.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

In computing, a polyglot markup is a document or script written in a valid form of multiple markup languages, which performs the same output, independent of the markup's parser, layout engine, or interpreter. In general, the polyglot markup is a common subset of two or more languages, that can be used as a robust or simplified profile.

A document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document with a document type definition (DTD). In the serialized form of the document, it manifests as a short string of markup that conforms to a particular syntax.

References

  1. "Welcome to the OmniMark 11.0 documentation". OmniMark Developer Resources. Retrieved 26 July 2022.
  2. Stilo International (2004). Beginner's Guide to OmniMark (PDF). p. 3. Retrieved 24 September 2018.
  3. Travis, Brian L. (1997). OmniMark at work: Getting Started. Englewood, CO: SGML University Press. p. vii.
  4. "Office Locations". Stilo. Retrieved 24 September 2018.
  5. "OmniMark 5 is Free". Cover Pages. Retrieved 24 September 2018.