Sweble

Last updated
Sweble
Original author(s) OSR Group
Initial releaseMay 1, 2011;12 years ago (2011-05-01) [1]
Stable release
2.0 / September 14, 2014;9 years ago (2014-09-14) [2]
Written in Java
Operating system Cross-platform
Type Parser
License Apache License
Website sweble.org [ dead link ]

The Sweble Wikitext parser [3] is an open-source tool to parse the Wikitext markup language used by MediaWiki, the software behind Wikipedia. The initial development was done by Hannes Dohrn as a Ph.D. thesis project at the Open Source Research Group of professor Dirk Riehle at the University of Erlangen-Nuremberg from 2009 until 2011. The results were presented to the public for the first time at the WikiSym conference in 2011. [4] Before that, the dissertation [5] was inspected and approved by an independent scientific peer-review and was published at ACM Press.

Contents

Based on the statistics at Ohloh [6] the parser is mainly written in the Java programming language. It was open-sourced in May 2011. [1] The parser itself is generated from a parsing expression grammar (PEG) using the Rats! parser generator Archived 2011-07-09 at the Wayback Machine . The encoding validation is done using a flex lexical analyser written in JFlex.

A preprint version of the paper on the design of the Sweble Wikitext Parser can be found at the projects homepage. [7] In addition to that, a summary page exists at the MediaWiki's futures. [8]

The current state of parsing

The parser used in MediaWiki converts the content directly from Wikitext into HTML. This process is done in two stages: [9]

  1. Searching and expansion of templates (like infoboxes), variables, and meta-information (e.g. {{lc:ABC}} gets converted into lower-case abc). Template pages can again have such meta-information so that those have to be evaluated as well (Recursion). This approach is similar to macro expansion used e.g. in programming languages like C++.
  2. Parsing and rendering of the now fully expanded text. Hereby, the text is processed by a sequence of built-in functions of MediaWiki that each recognise a single construct. Those analyse the content using Regular Expressions and replace e.g. = HEAD = with its HTML equivalent <h1>HEAD</h1>. In most of the cases, these steps are done line by line, with exceptions being tables or lists.

As the authors of Sweble write in their paper, [7] an analysis of the source code of MediaWiki's parser showed that the strategy of using separate transformation steps leads to new problems: Most of the used functions do not take the scope of the surrounding elements into account. This consequently leads to wrong nesting in the resulting HTML output. As a result, the evaluation and rendering of the latter can be ambiguous and depends on the rendering engine of the used web browser. They state:

"The individual processing steps often lead to unexpected and inconsistent behavior of the parser. For example, lists are recognized inside table cells. However, if the table itself appears inside a framed image, lists are not recognized." [7]

As argued on the WikiSym conference in 2008, a lack of language precision and component decoupling hinders evolution of wiki software. If the wiki content had a well-specified representation that is fully machine processable, this would not only lead to better accessibility of its content but also improve and extend the ways in which it can be processed. [10]

In addition, a well-defined object model for wiki content would allow further tools to operate on it. Until now there have been numerous attempts at implementing a new parser for MediaWiki (see ). None of them has succeeded so far. The authors of Sweble state that this might be "due to their choice of grammar, namely the well-known LALR(1) and LL(k) grammars. While these grammars are only a subset of context-free grammars, Wikitext requires global parser state and can therefore be considered a context-sensitive language." [7] As a result, they base their parser on a parsing expression grammar (PEG).

How Sweble works

Sweble parses the Wikitext and produces an abstract syntax tree as output. This helps to avoid errors from incorrect markup code (e.g. having a link spanning over multiple cells of a table). A detailed description of the abstract syntax tree model can be found in a technical report about the Wikitext Object Model (WOM). [11]

Steps of parsing

The parser processes Wikitext in five stages: [7]

1. Encoding validation
Since not all possible characters are allowed in Wikitext (e.g. control characters in Unicode), a cleaning step is needed before starting the actual parsing. In addition, some internal naming is performed to facilitate the later steps by making the resulting names for entities unique. In this process it must be ensured that character used as prefix for the parser are not escaped or changed. However, this stage should not lead to information loss due to stripping of characters from the input.
2. Pre-processing
After cleaning the text from illegal characters, the resulting Wikitext is prepared for expansion. For this purpose it is scanned for XML-like comments, meta-information such as redirections et cetera, conditional tags, and tag extensions. The latter are XML elements that are treated similar to parser functions or variables, respectively. XML elements with unknown names are treated as if they are generic text.
The result of this stage is an AST which consists mostly of text nodes but also redirect links, transclusion nodes and the nodes of tag extensions.
3. Expansion
Pages in a MediaWiki are often built using templates, magic words, parser functions and tag extensions. [9] To use the AST in a WYSIWYG editor, one would leave out expansion to see the unexpanded transclusion statements and parser function calls in the original page. However, for rendering the content e.g. as HTML page these must be processed to get the complete output. Moreover, pages used as templates can themselves transclude other pages which makes expansion a recursive process.
4. Parsing
Before parsing starts, the AST has to be converted back into Wikitext. Once this step is done, a PEG parser analyzes the text and generates an AST capturing the syntax and semantics of the wiki page.
5. Post-processing
In this stage tags are matched to form whole output elements. Moreover, apostrophes are analyzed to decide which of them are real prose apostrophes and which have to be interpreted as markup for bold or italic font in Wikitext. The assembly of paragraphs is also handled in this step. Hereby, the AST is processed using a depth-first traversal on the tree structure.
The rendering of the different kinds of output as well as the analyzing functions are realized as Visitors. This helps to separate the AST data structure from the algorithms that operate on the data.

Related Research Articles

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

<span class="mw-page-title-main">Transclusion</span> Including one data set inside another automatically

In computer science, transclusion is the inclusion of part or all of an electronic document into one or more other documents by reference via hypertext. Transclusion is usually performed when the referencing document is displayed, and is normally automatic and transparent to the end user. The result of transclusion is a single integrated document made of parts assembled dynamically from separate sources, possibly stored on different computers in disparate places.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">Wiki</span> Type of website that visitors can edit

A wiki is a form of online hypertext publication that is collaboratively edited and managed by its own audience directly through a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the public or limited to use within an organization for maintaining its internal knowledge base.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

An HTML element is a type of HTML document component, one of several types of HTML nodes. The first used version of HTML was written by Tim Berners-Lee in 1993 and there have since been many versions of HTML. The current de facto standard is governed by the industry group WHATWG and is known as the HTML Living Standard.

<span class="mw-page-title-main">MediaWiki</span> Free and open-source wiki software

MediaWiki is free and open-source wiki software originally developed by Magnus Manske for use on Wikipedia on January 25, 2002, and further improved by Lee Daniel Crocker, after which it has been coordinated by the Wikimedia Foundation. It powers several wiki hosting websites across the Internet, as well as most websites hosted by the Foundation including Wikipedia, Wiktionary, Wikimedia Commons, Wikiquote, Meta-Wiki and Wikidata, which define a large part of the set requirements for the software. MediaWiki is written in the PHP programming language and stores all text content into a database. The software is optimized to efficiently handle large projects, which can have terabytes of content and hundreds of thousands of views per second. Because Wikipedia is one of the world's largest and most visited websites, achieving scalability through multiple layers of caching and database replication has been a major concern for developers. Another major aspect of MediaWiki is its internationalization; its interface is available in more than 400 languages. The software has more than 1,000 configuration settings and more than 1,800 extensions available for enabling various features to be added or changed. Besides its usage on Wikimedia sites, MediaWiki has been used as a knowledge management and content management system on websites such as Fandom, wikiHow and major internal installations like Intellipedia and Diplopedia.

In web development, "tag soup" is a pejorative for HTML written for a web page that is syntactically or structurally incorrect. Web browsers have historically treated structural or syntax errors in HTML leniently, so there has been little pressure for web developers to follow published standards. Therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible.

A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightweight markup languages are used in applications where it may be necessary to read the raw document as well as the final rendered output.

XFA stands for XML Forms Architecture, a family of proprietary XML specifications that was suggested and developed by JetForm to enhance the processing of web forms. It can be also used in PDF files starting with the PDF 1.5 specification. The XFA specification is referenced as an external specification necessary for full application of the ISO 32000-1 specification. The XML Forms Architecture was not standardized as an ISO standard, and has been deprecated in PDF 2.0.

The following tables compare general and technical information for many wiki software packages.

Microformats (μF) are a set of defined HTML classes created to serve as consistent and descriptive metadata about an element, designating it as representing a certain type of data. They allow software to process the information reliably by having set classes refer to a specific type of data rather than being arbitrary. Microformats emerged around 2005 and were predominantly designed for use by search engines, web syndication and aggregators such as RSS.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

Data exchange is the process of taking data structured under a source schema and transforming it into a target schema, so that the target data is an accurate representation of the source data. Data exchange allows data to be shared between different computer programs.

Advanced Content provides interactivity in the HD DVD optical disc format.

Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.

<span class="mw-page-title-main">MoinMoin</span> Free wiki software

MoinMoin is a wiki engine implemented in Python, initially based on the PikiPiki wiki engine. Its name is a play on the North German greeting Moin, repeated as in WikiWiki. The MoinMoin code is licensed under the GNU General Public License v2, or any later version.

Creole is a lightweight markup language, aimed at being a common markup language for wikis, enabling and simplifying the transfer of content between different wiki engines.

A structured document is an electronic document where some method of markup is used to identify the whole and parts of the document as having various meanings beyond their formatting. For example, a structured document might identify a certain portion as a "chapter title" rather than as "Helvetica bold 24" or "indented Courier". Such portions in general are commonly called "components" or "elements" of a document.

<span class="mw-page-title-main">Yesod (web framework)</span>

Yesod is a web framework based on the programming language Haskell for productive development of type-safe, representational state transfer (REST) model based, high performance web applications, developed by Michael Snoyman, et al. It is free and open-source software released under an MIT License.

References

  1. 1 2 "announcement of the first public release of Sweble". Archived from the original on 2015-09-16. Retrieved 2011-11-24.
  2. "Sweble 2.0 released!". Archived from the original on 2015-02-27. Retrieved 2015-05-02.
  3. "Homepage of the Sweble project". Archived from the original on 2015-04-30. Retrieved 2011-11-24.
  4. "Announcement at WikiSym conference website". Archived from the original on 2013-07-03. Retrieved 2011-11-24.
  5. Dohrn, Hannes; Riehle, Dirk (2011). "Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia". Proceedings of the 7th International Symposium on Wikis and Open Collaboration (WikiSym 2011). ACM: 72–81. doi:10.1145/2038558.2038571. S2CID   9911791.
  6. Ohloh page of the Sweble project [ permanent dead link ]
  7. 1 2 3 4 5 "paper on the design of the Sweble Wikitext Parser" (PDF). Archived from the original (PDF) on 2015-02-24. Retrieved 2011-11-24.
  8. Future page for Sweble at MediaWiki
  9. 1 2 Markup Spec - MediaWiki
  10. Junghans, Martin; Riehle, Dirk; Gurram, Rama; Lopes, Mário; Yalcinalp, Umit (2008). "A grammar for standardized wiki markup". Proceedings of the 4th International Symposium on Wikis. pp. 21:1–21:8. doi:10.1145/1822258.1822287. ISBN   9781605581286. S2CID   29443210.{{cite book}}: CS1 maint: date and year (link)
  11. Dohrn, Hannes; Riehle, Dirk (July 2011). "WOM: An Object Model for Wikitext". Technical Report CS-2011-05 (July 2011). University of Erlangen.