The Parser that Cracked the MediaWiki Code

I am happy to announce that we finally open sourced the Sweble Wikitext parser. You can find the announcement on the OSR Group blog or directly on the Sweble project site. This is the work of Hannes Dohrn, my first Ph.D. student, who I hired in 2009 to implement a Wikitext parser.

So what about this “cracking the MediaWiki code”?

Wikipedia aims to bring the (encyclopedic) knowledge of the world to all of us, for free. While already ten years old, the Wikipedia community is just getting started, and we have barely seen the tip of the iceberg, there is so much more to come. All that wonderful content is being written by volunteers using a (seemingly) simple language called Wikitext (the stuff you type in once you click on edit). Until today, Wikitext had been poorly defined.

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers. That’s why a respected long-time community member asked in exasperation: Can anyone really edit Wikipedia?

The common answer given is to hide the complexity of Wikitext behind a visual editor, but that is not an answer. It doesn’t work: A visual editor, like any other algorithm that wants to work with Wikipedia content, needs a well-understood specification of the language that content is written in. This is where the Sweble parser comes in. Following well-understood computer science best practices, it uses a well-defined grammar that a parser generator uses to create a parser. It uses well-understood object-oriented design patterns (the Visitor pattern, prominently) to build a processing pipeline that transforms source Wikitext into whatever the desired output format is. And most importantly, it defines an abstract syntax tree (AST), document object model (DOM) tree soon, and works off that tree. We have come a long way from 5000 lines of php code.

So what does creating an AST and DOM tree for Wikitext buy us?

In short, it buys us interoperability and evolvability. In a 2007 paper, using the then hopeful wiki markup community standard WikiCreole, we explained the need for such interoperability and evolvability as defined by an open standard. Different tools can gather around that format and evolve independently. Today, everything has to go lock-step with MediaWiki. By untying the content and data from MediaWiki, we are enabling an ecosystem of tools and technology around Wikipedia (and related project) content so these projects can gain more speed and breadth.

20 thoughts on “The Parser that Cracked the MediaWiki Code

  1. Dirk Riehle Post author

    @chris – thanks. I think the intent is different: Our work is not (only) about lightweight formatting, but about the whole complexity of MediaWiki’s Wikitext. While I didn’t look at the project in detail, I’d be surprised if it could handle the complexity of Wikipedia pages (templates and tables, that is, multi-level includes, etc.)

    Reply
  2. bawolff

    It’d be interesting to hear if there are any performance comparisons between this and the mediawiki parser. Is it faster, slower, and if it is, is it significantly faster or slower.

    Also is it intended to have the exact same output as the php parser? I noticed some minor differences (I filed a bug). It’d be interesting to know if it passes mediawiki’s parser tests, or if it even is intended to.

    Reply
  3. Dirk Riehle Post author

    @andreas Thanks!

    @bawolff Thanks for filing the bug report! It is definitely slower (for now :-) ). Our parser is written in Java, not in php. Our goal is to help achieve evolvability of Wikitext again. In the extreme case, our parser has exactly one use: To transform Wikipedia content from the current version of Wikitext into a new version of Wikitext that can be processed more easily (DOM tree and all). For that, we need to get close enough to the output of the current MediaWiki parser (which can’t do this transformation).

    Reply
  4. Pingback: links for 2011-05-02 « Wild Webmink

  5. Jeroen De Dauw

    Coolness :)

    Is the idea to try get this into MediaWiki core? I think this will be an extremely hard sell when it’s written in Java, as it simply not work for many thousands of existing MediaWiki installs. In any case, are you in contact with the WMF about this? As I understand it, they have plans themselves to have the parser rewritten, which I figure is unrelated to this project. Maybe some cooperation/coordination is in order?

    Reply
  6. Dirk Riehle Post author

    @Jeroen Thanks! We have been in touch with the WMF since I first declared my group will implement a parser to Erik in 2009. Since it is written in Java, we were not assuming that it would become a direct part of MediaWiki. We demoed our parser at the WMF research summit in February this year – back then they had not yet started on a new parser. My hope is that there will be multiple parsers for different purposes, but most importantly, that these parsers will be for a new, cleaner, and simpler Wikitext based on an open standard. Sticking to the current Wikitext does not make sense to me.

    Reply
  7. James

    Congratulations Hannes and Dirk!

    I guess the question is: what percentage of existing Wikipedia pages does this parse “correctly”?

    Reply
  8. Dirk Riehle Post author

    Hey James, thanks! We didn’t check, but there clearly remains legwork to be done. It won’t be an open source project if we wouldn’t be looking for volunteers to help :-) However, we have made it past the summit and are running downhill.

    Frankly, the Wikitext people write is really fun(ny). Take a look at the Saxby Chambliss page on WP! There is a table split over multiple templates :-)

    Reply
  9. Dirk Riehle Post author

    @James PS: There is no correct parsing, as your quotation marks imply. The current MediaWiki parser happily creates broken HTML (e.g. an HREF link that crosses table cells) and leaves it to the browser as the final arbiter.

    Reply
  10. Thomas Luce

    This is really cool, Dirk. I remember hearing something about this before, and is a great direction to go! We built kiwi (https://github.com/aboutus/kiwi) for slightly different reasons but in a similar vein. We took a different approach, but I love the idea of shifting how you think of the problem as the key to solving it.

    Congrats again!

    Reply
  11. Dirk Riehle Post author

    Thanks, @Thomas. We met Karl Mathias at the Wikimedia summit, so I believe the Kiwi goal was to render proper HTML? Our goal was nothing like that, though it follows as an option. We really want to get to the content, independent of a particular purpose.

    Reply
  12. Pingback: Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining

  13. Pingback: Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining | IT Lyderis

  14. Pingback: IM U » Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

  15. Pingback: 坑爹吃货 | Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Leave a Reply