Original author(s) | Bruce R Miller |
---|---|
Initial release | 10 May 2004 |
Stable release | 0.8.8 / 29 February 2024 |
Repository | |
Written in | Perl |
Operating system | Unix-like, macOS, Windows |
Type | Document converter |
License | Public domain |
Website | dlmf |
LaTeXML is a free public domain software package which converts LaTeX documents to XML, HTML, EPUB, JATS and TEI. [1] [2] [3]
LaTeXML's primary output format is an XML representation of (La)TeX's document model. A postprocessor can convert these XML documents into other structured formats. Common use cases create HTML with mathematical formulas as images or XHTML, HTML5, and EPUB with formulas as MathML. Compared to other LaTeX-to-XML processors, LaTeXML aims to conserve the semantic structures of the LaTeX markup. This makes it a good basis for semantic services like Math search.
Conversion times range from 30 milliseconds for a single formula (in the LaTeXML daemon) to minutes for book-size documents.
LaTeXML was started in the context of the Digital Library of Mathematical Functions at NIST, where LaTeX documents needed to be prepared for publication on the Web. The system has been under active development for over a decade, and has attracted a small, but dedicated community of developers and users centered on Bruce Miller, the original project author.
The current released version is LaTeXML 0.8.8. It was released in February 2024, and development remains active on the public repository.
LaTeXML was used to convert 90% (60% without errors) of 530,000 documents from the arXiv to XML. [4] As a result of this ongoing effort for enhancing coverage, LaTeXML supports a large range of LaTeX packages. The ACL 2014 conference used LaTeXML to convert submitted papers to XML. [5] This followed existing work which has been trying to convert the ACL Anthology papers to high-quality semantic markup for further analysis. [6] Since February, 2013, LaTeXML has been used as to render the web pages on the peer produced mathematics website, PlanetMath. Since July, 2015, it was adopted by Authorea for their advanced LaTeX support. [7] In 2018, the second data release [8] of the European Space Agency's Gaia project was realized via LaTeXML.
In February of 2022, arXiv announced an experimental service based on LaTeXML, offering 1.78 million documents as HTML5. [9] A LaTeXML developer claimed successful conversion of 74% of arXiv, with 97% of articles "at least partially viewable". As of the start of 2024, that experiment has been promoted to arXiv's main article pages. [10] [11]
The core of LaTeXML is a Perl reimplementation of TeX's parsing and digestion algorithm coupled with a customizable XML emitter. To conserve the semantic structures in the LaTeX markup, LaTeXML needs XML bindings for all LaTeX packages with high-level macro definitions. The LaTeXML distribution currently provides XML bindings for over 200 commonly used LaTeX packages such as AMSTeX, Babel [12] and PGF/TikZ (which only has experimental support).
The LaTeXML conversion consists of two stages:
LaTeXML 0.8 added daemon functionality which enabled multiple conversions and easy embedding into web services.
LaTeXML 0.8.7 was the first version emitting the "MathML Core" markup language for mathematical syntax, new in MathML 4.
HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.
LaTeX is a software system for typesetting documents. LaTeX markup describes the content and layout of the document, as opposed to the formatted text found in WYSIWYG word processors like Microsoft Word, LibreOffice Writer and Apple Pages. The writer uses markup tagging conventions to define the general structure of a document, to stylise text throughout a document, and to add citations and cross-references. A TeX distribution such as TeX Live or MiKTeX is used to produce an output file suitable for printing or digital distribution.
A markuplanguage is a text-encoding system which specifies the structure and formatting of a document and potentially the relationship between its parts. Markup can control the display of a document or enrich its content to facilitate automated processing.
The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":
Mathematical Markup Language (MathML) is a mathematical markup language, an application of XML for describing mathematical notations and capturing both its structure and content, and is one of a number of mathematical markup languages. Its aim is to natively integrate mathematical formulae into World Wide Web pages and other documents. It is part of HTML5 and standardised by ISO/IEC since 2015.
GNU TeXmacs is a scientific word processor and typesetting component of the GNU Project. It originated as a variant of GNU Emacs with TeX functionalities, though it shares no code with those programs, while using TeX fonts. It is written and maintained by Joris van der Hoeven and a group of developers. The program produces structured documents with a WYSIWYG user interface. New document styles can be created by the user. The editor provides high-quality typesetting algorithms and TeX and other fonts for publishing professional looking documents.
A formula editor is a computer program that is used to typeset mathematical formulas and mathematical expressions.
OpenMath is the name of a markup language for specifying the meaning of mathematical formulae. Among other things, it can be used to complement MathML, a standard which mainly focuses on the presentation of formulae, with information about their semantic meaning. OpenMath can be encoded in XML or in a binary format.
RuleML is a global initiative, led by a non-profit organization RuleML Inc., that is devoted to advancing research and industry standards design activities in the technical area of rules that are semantic and highly inter-operable. The standards design takes the form primarily of a markup language, also known as RuleML. The research activities include an annual research conference, the RuleML Symposium, also known as RuleML for short. Founded in fall 2000 by Harold Boley, Benjamin Grosof, and Said Tabet, RuleML was originally devoted purely to standards design, but then quickly branched out into the related activities of coordinating research and organizing an annual research conference starting in 2002. The M in RuleML is sometimes interpreted as standing for Markup and Modeling. The markup language was developed to express both forward (bottom-up) and backward (top-down) rules in XML for deduction, rewriting, and further inferential-transformational tasks. It is defined by the Rule Markup Initiative, an open network of individuals and groups from both industry and academia that was formed to develop a canonical Web language for rules using XML markup and transformations from and to other rule standards/systems.
The following tables compare general and technical information for a number of document markup languages. Please see the individual markup languages' articles for further information.
OMDoc is a semantic markup format for mathematical documents. While MathML only covers mathematical formulae and the related OpenMath standard only supports formulae and “content dictionaries” containing definitions of the symbols used in formulae, OMDoc covers the whole range of written mathematics.
A mathematical markup language is a computer notation for representing mathematical formulae, based on mathematical notation. Specialized markup languages are necessary because computers normally deal with linear text and more limited character sets. A formally standardized syntax also allows a computer to interpret otherwise ambiguous content, for rendering or even evaluating. For computer-interpretable syntaxes, the most popular are TeX/LaTeX, MathML, OpenMath and OMDoc.
Michael Kohlhase is a German computer scientist and professor at University of Erlangen–Nuremberg, where he is head of the KWARC research group.
EPUB is an e-book file format that uses the ".epub" file extension. The term is short for electronic publication and is sometimes stylized as ePub. EPUB is supported by many e-readers, and compatible software is available for most smartphones, tablets, and computers. EPUB is a technical standard published by the International Digital Publishing Forum (IDPF). It became an official standard of the IDPF in September 2007, superseding the older Open eBook (OEB) standard.
The Office Open XML file formats are a set of file formats that can be used to represent electronic office documents. There are formats for word processing documents, spreadsheets and presentations as well as specific formats for material such as mathematical formulas, graphics, bibliographies etc.
TeX4ht is a configurable converter capable of translating TeX and LaTeX documents to HTML and certain XML formats. Most notably, TeX4ht serves for converting (La)TeX documents to formats used by word processors. It was developed by Eitan M. Gurari.
MathJax is a cross-browser JavaScript library that displays mathematical notation in web browsers, using MathML, LaTeX and ASCIIMathML markup. MathJax is released as open-source software under the Apache License.
Pandoc is a free-software document converter, widely used as a writing tool and as a basis for publishing workflows. It was created by John MacFarlane, a philosophy professor at the University of California, Berkeley.