The Parser That Cracked The MediaWiki Code

I am happy to announce that we finally open sourced the Sweble Wikitext parser. You can find the announcement on my research group’s blog or directly on the Sweble project site. This is the work of Hannes Dohrn, my first Ph.D. student, who I hired in 2009 to implement a Wikitext parser.

So what about this “cracking the MediaWiki code”?

Wikipedia aims to bring the (encyclopedic) knowledge of the world to all of us, for free. While already ten years old, the Wikipedia community is just getting started, and we have barely seen the tip of the iceberg, there is so much more to come. All that wonderful content is being written by volunteers using a (seemingly) simple language called Wikitext (the stuff you type in once you click on edit). Until today, Wikitext had been poorly defined.

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers. That’s why a respected long-time community member asked in exasperation: Can anyone really edit Wikipedia?

The common answer given is to hide the complexity of Wikitext behind a visual editor, but that is not an answer. It doesn’t work: A visual editor, like any other algorithm that wants to work with Wikipedia content, needs a well-understood specification of the language that content is written in. This is where the Sweble parser comes in. Following well-understood computer science best practices, it uses a well-defined grammar that a parser generator uses to create a parser. It uses well-understood object-oriented design patterns (the Visitor pattern, prominently) to build a processing pipeline that transforms source Wikitext into whatever the desired output format is. And most importantly, it defines an abstract syntax tree (AST), document object model (DOM) tree soon, and works off that tree. We have come a long way from 5000 lines of php code.

So what does creating an AST and DOM tree for Wikitext buy us?

In short, it buys us interoperability and evolvability. In a 2007 paper, using the then hopeful wiki markup community standard WikiCreole, we explained the need for such interoperability and evolvability as defined by an open standard. Different tools can gather around that format and evolve independently. Today, everything has to go lock-step with MediaWiki. By untying the content and data from MediaWiki, we are enabling an ecosystem of tools and technology around Wikipedia (and related project) content so these projects can gain more speed and breadth.

Posted on

2011-05-01

4. Society-at-large, 4.4 Wikis and Wikipedia

Tagged as (if any)

Subscribe!

Comments

Leave a Reply to Dirk RiehleCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 – KAYENRE Technology

2017-06-22

[…] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl […]

Reply
坑爹吃货 | Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

2014-08-18

[…] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl […]

Reply
IM U » Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

2014-08-17

[…] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl […]

Reply
Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining | IT Lyderis

2012-11-25

[…] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated: There was no grammar, no defined processing rules, and no defined output like a DOM tree […]

Reply
raja

2012-02-03

ponga da i want to learn more wiki markup code pls help me

Reply
Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining

2011-11-28

[…] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated: There was no grammar, no defined processing rules, and no defined output like a DOM tree […]

Reply
Dirk Riehle

2011-05-08

Thanks, @Thomas. We met Karl Mathias at the Wikimedia summit, so I believe the Kiwi goal was to render proper HTML? Our goal was nothing like that, though it follows as an option. We really want to get to the content, independent of a particular purpose.

Reply
Thomas Luce

2011-05-08

This is really cool, Dirk. I remember hearing something about this before, and is a great direction to go! We built kiwi (https://github.com/aboutus/kiwi) for slightly different reasons but in a similar vein. We took a different approach, but I love the idea of shifting how you think of the problem as the key to solving it.
Congrats again!

Reply
Dirk Riehle

2011-05-03

@James PS: There is no correct parsing, as your quotation marks imply. The current MediaWiki parser happily creates broken HTML (e.g. an HREF link that crosses table cells) and leaves it to the browser as the final arbiter.

Reply
Dirk Riehle

2011-05-03

Hey James, thanks! We didn’t check, but there clearly remains legwork to be done. It won’t be an open source project if we wouldn’t be looking for volunteers to help 🙂 However, we have made it past the summit and are running downhill.
Frankly, the Wikitext people write is really fun(ny). Take a look at the Saxby Chambliss page on WP! There is a table split over multiple templates 🙂

Reply
James

2011-05-03

Congratulations Hannes and Dirk!
I guess the question is: what percentage of existing Wikipedia pages does this parse “correctly”?

Reply
Dirk Riehle

2011-05-02

@Jeroen Thanks! We have been in touch with the WMF since I first declared my group will implement a parser to Erik in 2009. Since it is written in Java, we were not assuming that it would become a direct part of MediaWiki. We demoed our parser at the WMF research summit in February this year – back then they had not yet started on a new parser. My hope is that there will be multiple parsers for different purposes, but most importantly, that these parsers will be for a new, cleaner, and simpler Wikitext based on an open standard. Sticking to the current Wikitext does not make sense to me.

Reply
Jeroen De Dauw

2011-05-02

Coolness 🙂
Is the idea to try get this into MediaWiki core? I think this will be an extremely hard sell when it’s written in Java, as it simply not work for many thousands of existing MediaWiki installs. In any case, are you in contact with the WMF about this? As I understand it, they have plans themselves to have the parser rewritten, which I figure is unrelated to this project. Maybe some cooperation/coordination is in order?

Reply
links for 2011-05-02 « Wild Webmink

2011-05-02

[…] The Parser that Cracked the MediaWiki Code | Software Research and the Industry As many of us have been commenting for many years, the optimum freedom comes from neither open standards nor from open source alone but from the combination of the two. The wiki markup language used by MediaWiki is expressed only as open source code; this notable and valuable effort seeks to codify it in a way that makes it possible to repurpose wiki content in the future programatically. This kind of activity is enormously important culturally. (tags: Standards OpenSource Wikipedia MediaWiki Wiki Parser) […]

Reply
Dirk Riehle

2011-05-02

@andreas Thanks!
@bawolff Thanks for filing the bug report! It is definitely slower (for now 🙂 ). Our parser is written in Java, not in php. Our goal is to help achieve evolvability of Wikitext again. In the extreme case, our parser has exactly one use: To transform Wikipedia content from the current version of Wikitext into a new version of Wikitext that can be processed more easily (DOM tree and all). For that, we need to get close enough to the output of the current MediaWiki parser (which can’t do this transformation).

Reply
bawolff

2011-05-02

It’d be interesting to hear if there are any performance comparisons between this and the mediawiki parser. Is it faster, slower, and if it is, is it significantly faster or slower.
Also is it intended to have the exact same output as the php parser? I noticed some minor differences (I filed a bug). It’d be interesting to know if it passes mediawiki’s parser tests, or if it even is intended to.

Reply
Andreas Kuckartz

2011-05-01

One word: Great!

Reply
Dirk Riehle

2011-05-01

@chris – thanks. I think the intent is different: Our work is not (only) about lightweight formatting, but about the whole complexity of MediaWiki’s Wikitext. While I didn’t look at the project in detail, I’d be surprised if it could handle the complexity of Wikipedia pages (templates and tables, that is, multi-level includes, etc.)

Reply
Chris Aniszczyk

2011-05-01

Have you looked at some of the work over at the Eclipse Mylyn Wikitext project?
http://wiki.eclipse.org/Mylyn/Incubator/WikiText
It’s a pretty mature project that works with a variety of wiki markup languages out there, along with a bunch of converters. It’s all available under the EPL 1.0

Reply
Dirk Riehle

2011-05-01

Great! Let us know how it goes and if we can be of help over on http://sweble.org

Reply
James Michael DuPont

2011-05-01

I am very happy to hear about this, I will be trying it out.
mike

Reply