Sweble
   HOME

TheInfoList



OR:

The Sweble Wikitext parser is an open-source tool to
parse Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
the
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
markup language Markup language refers to a text-encoding system consisting of a set of symbols inserted in a text document to control its structure, formatting, or the relationship between its parts. Markup is often used to control the display of the document ...
used by
MediaWiki MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki ...
, the software behind
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
. The initial development was done by Hannes Dohrn as a
Ph.D. A Doctor of Philosophy (PhD, Ph.D., or DPhil; Latin: or ') is the most common degree at the highest academic level awarded following a course of study. PhDs are awarded for programs across the whole breadth of academic fields. Because it is a ...
thesis A thesis ( : theses), or dissertation (abbreviated diss.), is a document submitted in support of candidature for an academic degree or professional qualification presenting the author's research and findings.International Standard ISO 7144: ...
project at th
Open Source Research Group
of professor
Dirk Riehle A dirk is a long bladed thrusting dagger.Chisholm, Hugh (ed.), ''Dagger'', The Encyclopædia Britannica, 11th ed., Vol. VII, New York, NY: Cambridge University Press (1910), p. 729 Historically, it gained its name from the Highland Dirk (Scot ...
at the
University of Erlangen-Nuremberg A university () is an institution of higher (or tertiary) education and research which awards academic degrees in several academic disciplines. Universities typically offer both undergraduate and postgraduate programs. In the United States, t ...
from 2009 until 2011. The results were presented to the public for the first time at the
WikiSym OpenSym is a shorthand for International Symposium on Open Collaboration, formerly International Symposium on Wikis and Open Collaboration, also formerly WikiSym or the Wiki Symposium, a conference dedicated to wiki research and practice. In 20 ...
conference in 2011. Before that, the dissertation was inspected and approved by an independent scientific peer-review and was published at
ACM Press The Association for Computing Machinery (ACM) is a US-based international learned society for computing. It was founded in 1947 and is the world's largest scientific and educational computing society. The ACM is a non-profit professional member ...
. Based on the statistics at
Ohloh Black Duck Open Hub, formerly Ohloh, is a website which provides a web services suite and online community platform that aims to index the open-source software development community. It was founded by former Microsoft managers Jason Allen and Sc ...
the parser is mainly written in the
Java programming language Java is a high-level, class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let programmers ''write once, run anywh ...
. It was open-sourced in May 2011. The parser itself is generated from a parsing expression grammar (PEG) using th
Rats! parser generator
The encoding validation is done using a
flex lexical analyser Flex (fast lexical analyzer generator) is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers (also known as "scanners" or "lexers"). It is frequently used as the lex implementation togeth ...
written i
JFlex
A preprint version of the paper on the design of the Sweble Wikitext Parser can be found at the projects homepage. In addition to that, a summary page exists at the
MediaWiki MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki ...
's futures.


The current state of parsing

The parser used in
MediaWiki MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki ...
converts the content directly from
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
into
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
. This process is done in two stages:Markup Spec - MediaWiki
/ref> # Searching and expansion of
templates Template may refer to: Tools * Die (manufacturing), used to cut or shape material * Mold, in a molding process * Stencil, a pattern or overlay used in graphic arts (drawing, painting, etc.) and sewing to replicate letters, shapes or designs Co ...
(like infoboxes), variables, and meta-information (e.g. '''' gets converted into lower-case ''abc''). Template pages can again have such meta-information so that those have to be evaluated as well (
Recursion Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
). This approach is similar to
macro expansion In computer programming, a macro (short for "macro instruction"; ) is a rule or pattern that specifies how a certain input should be Map (mathematics), mapped to a replacement output. Applying a macro to an input is known as macro expansion. Th ...
used e.g. in
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
s like
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
. # Parsing and rendering of the now fully expanded text. Hereby, the text is processed by a sequence of built-in functions of
MediaWiki MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki ...
that each recognise a single construct. Those analyse the content using
Regular Expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
and replace e.g. ''= HEAD ='' with its
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
equivalent ''<h1>HEAD</h1>''. In most of the cases, these steps are done line by line, with exceptions being tables or lists. As the authors of Sweble write in their paper, an analysis of the
source code In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the wo ...
of
MediaWiki MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki ...
's parser showed that the strategy of using separate transformation steps leads to new problems: Most of the used functions do not take the
scope Scope or scopes may refer to: People with the surname * Jamie Scope (born 1986), English footballer * John T. Scopes (1900–1970), central figure in the Scopes Trial regarding the teaching of evolution Arts, media, and entertainment * Cinem ...
of the surrounding elements into account. This consequently leads to wrong nesting in the resulting
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
output. As a result, the evaluation and rendering of the latter can be ambiguous and depends on the rendering engine of the used
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...
. They state: :"The individual processing steps often lead to unexpected and inconsistent behavior of the parser. For example, lists are recognized inside table cells. However, if the table itself appears inside a framed image, lists are not recognized." As argued on the
WikiSym OpenSym is a shorthand for International Symposium on Open Collaboration, formerly International Symposium on Wikis and Open Collaboration, also formerly WikiSym or the Wiki Symposium, a conference dedicated to wiki research and practice. In 20 ...
conference in 2008, a lack of language precision and component decoupling hinders evolution of
wiki A wiki ( ) is an online hypertext publication collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the pu ...
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
. If the wiki content had a well-specified representation that is fully machine processable, this would not only lead to better accessibility of its content but also improve and extend the ways in which it can be processed. In addition, a well-defined
object model In computing, object model has two related but distinct meanings: # The properties of objects in general in a specific computer programming language, technology, notation or methodology that uses them. Examples are the object models of ''Java'', ...
for wiki content would allow further tools to operate on it. Until now there have been numerous attempts at implementing a new parser for
MediaWiki MediaWiki is a free and open-source wiki software. It is used on Wikipedia and almost all other Wikimedia websites, including Wiktionary, Wikimedia Commons and Wikidata; these sites define a large part of the requirement set for MediaWiki ...
(se

. None of them has succeeded so far. The authors of Sweble state that this might be "due to their choice of Formal grammar, grammar, namely the well-known
LALR In computer science, an LALR parser or Look-Ahead LR parser is a simplified version of a canonical LR parser, to parse a text according to a set of production rules specified by a formal grammar for a computer language. ("LR" means left-to-right, ...
(1) and LL(k) grammars. While these grammars are only a subset of context-free grammars,
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
requires global parser state and can therefore be considered a
context-sensitive language In formal language theory, a context-sensitive language is a language that can be defined by a context-sensitive grammar (and equivalently by a noncontracting grammar). Context-sensitive is one of the four types of grammars in the Chomsky hierarc ...
." As a result, they base their parser on a parsing expression grammar (PEG).


How Sweble works

Sweble parses the
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
and produces an
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring ...
as output. This helps to avoid errors from incorrect markup code (e.g. having a link spanning over multiple cells of a table). A detailed description of the
abstract syntax tree In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring ...
model A model is an informative representation of an object, person or system. The term originally denoted the Plan_(drawing), plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin ''modulus'', a mea ...
can be found in a
technical report A technical report (also scientific report) is a document that describes the process, progress, or results of technical or scientific research or the state of a technical or scientific research problem. It might also include recommendations and co ...
about the Wikitext Object Model (WOM).


Steps of parsing

The parser processes
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
in five stages: ; 1. Encoding validation : Since not all possible characters are allowed in
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
(e.g. control characters in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
), a cleaning step is needed before starting the actual parsing. In addition, some internal naming is performed to facilitate the later steps by making the resulting names for entities unique. In this process it must be ensured that character used as prefix for the parser are not escaped or changed. However, this stage should not lead to information loss due to stripping of characters from the input. ; 2. Pre-processing : After cleaning the text from illegal characters, the resulting
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
is prepared for expansion. For this purpose it is scanned for
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
-like comments, meta-information such as redirections et cetera, conditional tags, and tag extensions. The latter are
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
elements that are treated similar to parser functions or variables, respectively.
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
elements with unknown names are treated as if they are generic text.
The result of this stage is an AST which consists mostly of text nodes but also redirect links, transclusion nodes and the nodes of tag extensions. ; 3. Expansion : Pages in a MediaWiki are often built using templates, magic words, parser functions and tag extensions. To use the AST in a WYSIWYG editor, one would leave out expansion to see the unexpanded
transclusion In computer science, transclusion is the inclusion of part or all of an electronic document into one or more other documents by reference via hypertext. Transclusion is usually performed when the referencing document is displayed, and is normal ...
statements and parser function calls in the original page. However, for rendering the content e.g. as
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
page these must be processed to get the complete output. Moreover, pages used as templates can themselves transclude other pages which makes expansion a recursive process. ; 4. Parsing : Before
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
starts, the AST has to be converted back into
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
. Once this step is done, a PEG parser analyzes the text and generates an AST capturing the
syntax In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituency) ...
and
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy Philosophy (f ...
of the
wiki A wiki ( ) is an online hypertext publication collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the pu ...
page. ; 5. Post-processing : In this stage tags are matched to form whole output elements. Moreover,
apostrophe The apostrophe ( or ) is a punctuation mark, and sometimes a diacritical mark, in languages that use the Latin alphabet and some other alphabets. In English, the apostrophe is used for two basic purposes: * The marking of the omission of one o ...
s are analyzed to decide which of them are real
prose Prose is a form of written or spoken language that follows the natural flow of speech, uses a language's ordinary grammatical structures, or follows the conventions of formal academic writing. It differs from most traditional poetry, where the f ...
apostrophes and which have to be interpreted as markup for
bold In typography, emphasis is the strengthening of words in a text with a font in a different style from the rest of the text, to highlight them. It is the equivalent of prosody stress in speech. Methods and use The most common methods in W ...
or italic font in
Wikitext A wiki ( ) is an online hypertext publication Collaborative editing, collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be ...
. The assembly of paragraphs is also handled in this step. Hereby, the AST is processed using a
depth-first traversal Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible alon ...
on the
tree structure A tree structure, tree diagram, or tree model is a way of representing the hierarchical nature of a structure in a graphical form. It is named a "tree structure" because the classic representation resembles a tree, although the chart is gener ...
.
The rendering of the different kinds of output as well as the analyzing functions are realized as
Visitors Visitor, in English and Welsh law, is an academic or ecclesiastical title. Visitor or Visitors may also refer to: Geography * Visitor (mountain), a mountain in eastern Montenegro * Lake Visitor, a mountain lake in eastern Montenegro Literature * ...
. This helps to separate the AST data structure from the
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
s that operate on the
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
.


References

{{reflist, 30em Parsing MediaWiki Compiling tools Unix programming tools