Dirk Riehle's Industry and Research Publications

The Parser That Cracked The MediaWiki Code

I am happy to announce that we finally open sourced the Sweble Wikitext parser. You can find the announcement on my research group’s blog or directly on the Sweble project site. This is the work of Hannes Dohrn, my first Ph.D. student, who I hired in 2009 to implement a Wikitext parser.

So what about this “cracking the MediaWiki code”?

Wikipedia aims to bring the (encyclopedic) knowledge of the world to all of us, for free. While already ten years old, the Wikipedia community is just getting started, and we have barely seen the tip of the iceberg, there is so much more to come. All that wonderful content is being written by volunteers using a (seemingly) simple language called Wikitext (the stuff you type in once you click on edit). Until today, Wikitext had been poorly defined.

There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers. That’s why a respected long-time community member asked in exasperation: Can anyone really edit Wikipedia?

The common answer given is to hide the complexity of Wikitext behind a visual editor, but that is not an answer. It doesn’t work: A visual editor, like any other algorithm that wants to work with Wikipedia content, needs a well-understood specification of the language that content is written in. This is where the Sweble parser comes in. Following well-understood computer science best practices, it uses a well-defined grammar that a parser generator uses to create a parser. It uses well-understood object-oriented design patterns (the Visitor pattern, prominently) to build a processing pipeline that transforms source Wikitext into whatever the desired output format is. And most importantly, it defines an abstract syntax tree (AST), document object model (DOM) tree soon, and works off that tree. We have come a long way from 5000 lines of php code.

So what does creating an AST and DOM tree for Wikitext buy us?

In short, it buys us interoperability and evolvability. In a 2007 paper, using the then hopeful wiki markup community standard WikiCreole, we explained the need for such interoperability and evolvability as defined by an open standard. Different tools can gather around that format and evolve independently. Today, everything has to go lock-step with MediaWiki. By untying the content and data from MediaWiki, we are enabling an ecosystem of tools and technology around Wikipedia (and related project) content so these projects can gain more speed and breadth.

Subscription

Comments

  1. […] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl […]

  2. […] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl […]

  3. […] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl […]

  4. […] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated: There was no grammar, no defined processing rules, and no defined output like a DOM tree […]

  5. raja Avatar
    raja

    ponga da i want to learn more wiki markup code pls help me

  6. […] formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated: There was no grammar, no defined processing rules, and no defined output like a DOM tree […]

  7. Dirk Riehle Avatar

    Thanks, @Thomas. We met Karl Mathias at the Wikimedia summit, so I believe the Kiwi goal was to render proper HTML? Our goal was nothing like that, though it follows as an option. We really want to get to the content, independent of a particular purpose.

  8. Thomas Luce Avatar

    This is really cool, Dirk. I remember hearing something about this before, and is a great direction to go! We built kiwi (https://github.com/aboutus/kiwi) for slightly different reasons but in a similar vein. We took a different approach, but I love the idea of shifting how you think of the problem as the key to solving it.
    Congrats again!

  9. Dirk Riehle Avatar

    @James PS: There is no correct parsing, as your quotation marks imply. The current MediaWiki parser happily creates broken HTML (e.g. an HREF link that crosses table cells) and leaves it to the browser as the final arbiter.

  10. Dirk Riehle Avatar

    Hey James, thanks! We didn’t check, but there clearly remains legwork to be done. It won’t be an open source project if we wouldn’t be looking for volunteers to help 🙂 However, we have made it past the summit and are running downhill.
    Frankly, the Wikitext people write is really fun(ny). Take a look at the Saxby Chambliss page on WP! There is a table split over multiple templates 🙂

  11. James Avatar

    Congratulations Hannes and Dirk!
    I guess the question is: what percentage of existing Wikipedia pages does this parse “correctly”?

  12. Dirk Riehle Avatar

    @Jeroen Thanks! We have been in touch with the WMF since I first declared my group will implement a parser to Erik in 2009. Since it is written in Java, we were not assuming that it would become a direct part of MediaWiki. We demoed our parser at the WMF research summit in February this year – back then they had not yet started on a new parser. My hope is that there will be multiple parsers for different purposes, but most importantly, that these parsers will be for a new, cleaner, and simpler Wikitext based on an open standard. Sticking to the current Wikitext does not make sense to me.

  13. Jeroen De Dauw Avatar

    Coolness 🙂
    Is the idea to try get this into MediaWiki core? I think this will be an extremely hard sell when it’s written in Java, as it simply not work for many thousands of existing MediaWiki installs. In any case, are you in contact with the WMF about this? As I understand it, they have plans themselves to have the parser rewritten, which I figure is unrelated to this project. Maybe some cooperation/coordination is in order?

  14. […] The Parser that Cracked the MediaWiki Code | Software Research and the Industry As many of us have been commenting for many years, the optimum freedom comes from neither open standards nor from open source alone but from the combination of the two. The wiki markup language used by MediaWiki is expressed only as open source code; this notable and valuable effort seeks to codify it in a way that makes it possible to repurpose wiki content in the future programatically. This kind of activity is enormously important culturally. (tags: Standards OpenSource Wikipedia MediaWiki Wiki Parser) […]

  15. Dirk Riehle Avatar

    @andreas Thanks!
    @bawolff Thanks for filing the bug report! It is definitely slower (for now 🙂 ). Our parser is written in Java, not in php. Our goal is to help achieve evolvability of Wikitext again. In the extreme case, our parser has exactly one use: To transform Wikipedia content from the current version of Wikitext into a new version of Wikitext that can be processed more easily (DOM tree and all). For that, we need to get close enough to the output of the current MediaWiki parser (which can’t do this transformation).

  16. bawolff Avatar
    bawolff

    It’d be interesting to hear if there are any performance comparisons between this and the mediawiki parser. Is it faster, slower, and if it is, is it significantly faster or slower.
    Also is it intended to have the exact same output as the php parser? I noticed some minor differences (I filed a bug). It’d be interesting to know if it passes mediawiki’s parser tests, or if it even is intended to.

  17. Andreas Kuckartz Avatar
  18. Dirk Riehle Avatar

    @chris – thanks. I think the intent is different: Our work is not (only) about lightweight formatting, but about the whole complexity of MediaWiki’s Wikitext. While I didn’t look at the project in detail, I’d be surprised if it could handle the complexity of Wikipedia pages (templates and tables, that is, multi-level includes, etc.)

  19. Chris Aniszczyk Avatar

    Have you looked at some of the work over at the Eclipse Mylyn Wikitext project?
    http://wiki.eclipse.org/Mylyn/Incubator/WikiText
    It’s a pretty mature project that works with a variety of wiki markup languages out there, along with a bunch of converters. It’s all available under the EPL 1.0

  20. Dirk Riehle Avatar

    Great! Let us know how it goes and if we can be of help over on http://sweble.org

  21. James Michael DuPont Avatar

    I am very happy to hear about this, I will be trying it out.
    mike

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Navigation

Posted on

Share the joy

Share on LinkedIn

Share by email

Share on X (Twitter)

Share on WhatsApp

Featured startups

QDAcity makes qualitative research and qualitative data analysis fun and easy.
EDITIVE makes inter- and intra-company document collaboration more effective.

Featured projects

Making free and open data easy, safe, and reliable to use
Bringing business intelligence to engineering management
Making open source in products easy, safe, and fun to use