The Parser that Cracked the MediaWiki Code

I am happy to announce that we finally open sourced the Swe­ble Wiki­text parser. You can find the announce­ment on the OSR Group blog or directly on the Swe­ble project site. This is the work of Hannes Dohrn, my first Ph.D. stu­dent, who I hired in 2009 to imple­ment a Wiki­text parser.

So what about this “crack­ing the Medi­aWiki code”?

Wikipedia aims to bring the (ency­clo­pe­dic) knowl­edge of the world to all of us, for free. While already ten years old, the Wikipedia com­mu­nity is just get­ting started, and we have barely seen the tip of the ice­berg, there is so much more to come. All that won­der­ful con­tent is being writ­ten by vol­un­teers using a (seem­ingly) sim­ple lan­guage called Wiki­text (the stuff you type in once you click on edit). Until today, Wiki­text had been poorly defined.

There was no gram­mar, no defined pro­cess­ing rules, and no defined out­put like a DOM tree based on a well defined doc­u­ment object model. This is to say, the con­tent of Wikipedia is stored in a for­mat that is not an open stan­dard. The for­mat is defined by 5000 lines of php code (the parse func­tion of Medi­aWiki). That code may be open source, but it is incom­pre­hen­si­ble to most. That’s why there are 30+ failed attempts at writ­ing alter­na­tive parsers. That’s why a respected long-time com­mu­nity mem­ber asked in exas­per­a­tion: Can any­one really edit Wikipedia?

The com­mon answer given is to hide the com­plex­ity of Wiki­text behind a visual edi­tor, but that is not an answer. It doesn’t work: A visual edi­tor, like any other algo­rithm that wants to work with Wikipedia con­tent, needs a well-under­stood spec­i­fi­ca­tion of the lan­guage that con­tent is writ­ten in. This is where the Swe­ble parser comes in. Fol­low­ing well-under­stood com­puter sci­ence best prac­tices, it uses a well-defined gram­mar that a parser gen­er­a­tor uses to cre­ate a parser. It uses well-under­stood object-ori­ented design pat­terns (the Vis­i­tor pat­tern, promi­nently) to build a pro­cess­ing pipeline that trans­forms source Wiki­text into what­ever the desired out­put for­mat is. And most impor­tantly, it defines an abstract syn­tax tree (AST), doc­u­ment object model (DOM) tree soon, and works off that tree. We have come a long way from 5000 lines of php code.

So what does cre­at­ing an AST and DOM tree for Wiki­text buy us?

In short, it buys us inter­op­er­abil­ity and evolv­abil­ity. In a 2007 paper, using the then hope­ful wiki markup com­mu­nity stan­dard Wiki­Cre­ole, we explained the need for such inter­op­er­abil­ity and evolv­abil­ity as defined by an open stan­dard. Dif­fer­ent tools can gather around that for­mat and evolve inde­pen­dently. Today, every­thing has to go lock-step with Medi­aWiki. By unty­ing the con­tent and data from Medi­aWiki, we are enabling an ecosys­tem of tools and tech­nol­ogy around Wikipedia (and related project) con­tent so these projects can gain more speed and breadth.

20 thoughts on “The Parser that Cracked the MediaWiki Code

  1. Dirk Riehle Post author

    @chris — thanks. I think the intent is dif­fer­ent: Our work is not (only) about light­weight for­mat­ting, but about the whole com­plex­ity of MediaWiki’s Wiki­text. While I didn’t look at the project in detail, I’d be sur­prised if it could han­dle the com­plex­ity of Wikipedia pages (tem­plates and tables, that is, multi-level includes, etc.)

    Reply
  2. bawolff

    It’d be inter­est­ing to hear if there are any per­for­mance com­par­isons between this and the medi­awiki parser. Is it faster, slower, and if it is, is it sig­nif­i­cantly faster or slower.

    Also is it intended to have the exact same out­put as the php parser? I noticed some minor dif­fer­ences (I filed a bug). It’d be inter­est­ing to know if it passes mediawiki’s parser tests, or if it even is intended to.

    Reply
  3. Dirk Riehle Post author

    @andreas Thanks!

    @bawolff Thanks for fil­ing the bug report! It is def­i­nitely slower (for now 🙂 ). Our parser is writ­ten in Java, not in php. Our goal is to help achieve evolv­abil­ity of Wiki­text again. In the extreme case, our parser has exactly one use: To trans­form Wikipedia con­tent from the cur­rent ver­sion of Wiki­text into a new ver­sion of Wiki­text that can be processed more eas­ily (DOM tree and all). For that, we need to get close enough to the out­put of the cur­rent Medi­aWiki parser (which can’t do this trans­for­ma­tion).

    Reply
  4. Pingback: links for 2011-05-02 « Wild Webmink

  5. Jeroen De Dauw

    Cool­ness 🙂

    Is the idea to try get this into Medi­aWiki core? I think this will be an extremely hard sell when it’s writ­ten in Java, as it sim­ply not work for many thou­sands of exist­ing Medi­aWiki installs. In any case, are you in con­tact with the WMF about this? As I under­stand it, they have plans them­selves to have the parser rewrit­ten, which I fig­ure is unre­lated to this project. Maybe some cooperation/coordination is in order?

    Reply
  6. Dirk Riehle Post author

    @Jeroen Thanks! We have been in touch with the WMF since I first declared my group will imple­ment a parser to Erik in 2009. Since it is writ­ten in Java, we were not assum­ing that it would become a direct part of Medi­aWiki. We demoed our parser at the WMF research sum­mit in Feb­ru­ary this year — back then they had not yet started on a new parser. My hope is that there will be mul­ti­ple parsers for dif­fer­ent pur­poses, but most impor­tantly, that these parsers will be for a new, cleaner, and sim­pler Wiki­text based on an open stan­dard. Stick­ing to the cur­rent Wiki­text does not make sense to me.

    Reply
  7. James

    Con­grat­u­la­tions Hannes and Dirk!

    I guess the ques­tion is: what per­cent­age of exist­ing Wikipedia pages does this parse “cor­rectly”?

    Reply
  8. Dirk Riehle Post author

    Hey James, thanks! We didn’t check, but there clearly remains leg­work to be done. It won’t be an open source project if we wouldn’t be look­ing for vol­un­teers to help 🙂 How­ever, we have made it past the sum­mit and are run­ning down­hill.

    Frankly, the Wiki­text peo­ple write is really fun(ny). Take a look at the Saxby Cham­b­liss page on WP! There is a table split over mul­ti­ple tem­plates 🙂

    Reply
  9. Dirk Riehle Post author

    @James PS: There is no cor­rect pars­ing, as your quo­ta­tion marks imply. The cur­rent Medi­aWiki parser hap­pily cre­ates bro­ken HTML (e.g. an HREF link that crosses table cells) and leaves it to the browser as the final arbiter.

    Reply
  10. Thomas Luce

    This is really cool, Dirk. I remem­ber hear­ing some­thing about this before, and is a great direc­tion to go! We built kiwi (https://github.com/aboutus/kiwi) for slightly dif­fer­ent rea­sons but in a sim­i­lar vein. We took a dif­fer­ent approach, but I love the idea of shift­ing how you think of the prob­lem as the key to solv­ing it.

    Con­grats again!

    Reply
  11. Dirk Riehle Post author

    Thanks, @Thomas. We met Karl Math­ias at the Wiki­me­dia sum­mit, so I believe the Kiwi goal was to ren­der proper HTML? Our goal was noth­ing like that, though it fol­lows as an option. We really want to get to the con­tent, inde­pen­dent of a par­tic­u­lar pur­pose.

    Reply
  12. Pingback: Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining

  13. Pingback: Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining | IT Lyderis

  14. Pingback: IM U » Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

  15. Pingback: 坑爹吃货 | Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Leave a Reply