The Parser that Cracked the MediaWiki Code

I am hap­py to announce that we final­ly open sourced the Swe­ble Wiki­text parser. You can find the announce­ment on the OSR Group blog or direct­ly on the Swe­ble project site. This is the work of Han­nes Dohrn, my first Ph.D. stu­dent, who I hired in 2009 to imple­ment a Wiki­text parser.

So what about this “crack­ing the Medi­aWiki code”?

Wikipedia aims to bring the (ency­clo­pe­dic) knowl­edge of the world to all of us, for free. While already ten years old, the Wikipedia com­mu­ni­ty is just get­ting start­ed, and we have bare­ly seen the tip of the ice­berg, there is so much more to come. All that won­der­ful con­tent is being writ­ten by vol­un­teers using a (seem­ing­ly) sim­ple lan­guage called Wiki­text (the stuff you type in once you click on edit). Until today, Wiki­text had been poor­ly defined.

There was no gram­mar, no defined pro­cess­ing rules, and no defined out­put like a DOM tree based on a well defined doc­u­ment object mod­el. This is to say, the con­tent of Wikipedia is stored in a for­mat that is not an open stan­dard. The for­mat is defined by 5000 lines of php code (the parse func­tion of Medi­aWiki). That code may be open source, but it is incom­pre­hen­si­ble to most. That’s why there are 30+ failed attempts at writ­ing alter­na­tive parsers. That’s why a respect­ed long-time com­mu­ni­ty mem­ber asked in exas­per­a­tion: Can any­one real­ly edit Wikipedia?

The com­mon answer given is to hide the com­plex­i­ty of Wiki­text behind a visu­al edi­tor, but that is not an answer. It doesn’t work: A visu­al edi­tor, like any oth­er algo­rithm that wants to work with Wikipedia con­tent, needs a well-understood spec­i­fi­ca­tion of the lan­guage that con­tent is writ­ten in. This is where the Swe­ble parser comes in. Fol­low­ing well-understood com­put­er sci­ence best prac­tices, it uses a well-defined gram­mar that a parser gen­er­a­tor uses to cre­ate a parser. It uses well-understood object-oriented design pat­terns (the Vis­i­tor pat­tern, promi­nent­ly) to build a pro­cess­ing pipeline that trans­forms source Wiki­text into what­ev­er the desired out­put for­mat is. And most impor­tant­ly, it defines an abstract syn­tax tree (AST), doc­u­ment object mod­el (DOM) tree soon, and works off that tree. We have come a long way from 5000 lines of php code.

So what does cre­at­ing an AST and DOM tree for Wiki­text buy us?

In short, it buys us inter­op­er­abil­i­ty and evolv­abil­i­ty. In a 2007 paper, using the then hope­ful wiki markup com­mu­ni­ty stan­dard Wiki­Cre­ole, we explained the need for such inter­op­er­abil­i­ty and evolv­abil­i­ty as defined by an open stan­dard. Dif­fer­ent tools can gath­er around that for­mat and evolve inde­pen­dent­ly. Today, every­thing has to go lock-step with Medi­aWiki. By unty­ing the con­tent and data from Medi­aWiki, we are enabling an ecosys­tem of tools and tech­nol­o­gy around Wikipedia (and relat­ed project) con­tent so the­se projects can gain more speed and breadth.

20 thoughts on “The Parser that Cracked the MediaWiki Code

  1. Dirk Riehle Post author

    @chris — thanks. I think the intent is dif­fer­ent: Our work is not (only) about light­weight for­mat­ting, but about the whole com­plex­i­ty of MediaWiki’s Wiki­text. While I didn’t look at the project in detail, I’d be sur­prised if it could han­dle the com­plex­i­ty of Wikipedia pages (tem­plates and tables, that is, multi-level includes, etc.)

  2. bawolff

    It’d be inter­est­ing to hear if there are any per­for­mance com­par­isons between this and the medi­awiki parser. Is it faster, slow­er, and if it is, is it sig­nif­i­cant­ly faster or slow­er.

    Also is it intend­ed to have the exact same out­put as the php parser? I noticed some minor dif­fer­ences (I filed a bug). It’d be inter­est­ing to know if it pass­es mediawiki’s parser tests, or if it even is intend­ed to.

  3. Dirk Riehle Post author

    @andreas Thanks!

    @bawolff Thanks for fil­ing the bug report! It is def­i­nite­ly slow­er (for now 🙂 ). Our parser is writ­ten in Java, not in php. Our goal is to help achieve evolv­abil­i­ty of Wiki­text again. In the extreme case, our parser has exact­ly one use: To trans­form Wikipedia con­tent from the cur­rent ver­sion of Wiki­text into a new ver­sion of Wiki­text that can be processed more eas­i­ly (DOM tree and all). For that, we need to get close enough to the out­put of the cur­rent Medi­aWiki parser (which can’t do this trans­for­ma­tion).

  4. Pingback: links for 2011-05-02 « Wild Webmink

  5. Jeroen De Dauw

    Cool­ness 🙂

    Is the idea to try get this into Medi­aWiki core? I think this will be an extreme­ly hard sell when it’s writ­ten in Java, as it sim­ply not work for many thou­sands of exist­ing Medi­aWiki installs. In any case, are you in con­tact with the WMF about this? As I under­stand it, they have plans them­selves to have the parser rewrit­ten, which I fig­ure is unre­lat­ed to this project. May­be some cooperation/coordination is in order?

  6. Dirk Riehle Post author

    @Jeroen Thanks! We have been in touch with the WMF since I first declared my group will imple­ment a parser to Erik in 2009. Since it is writ­ten in Java, we were not assum­ing that it would become a direct part of Medi­aWiki. We demoed our parser at the WMF research sum­mit in Feb­ru­ary this year — back then they had not yet start­ed on a new parser. My hope is that there will be mul­ti­ple parsers for dif­fer­ent pur­pos­es, but most impor­tant­ly, that the­se parsers will be for a new, clean­er, and sim­pler Wiki­text based on an open stan­dard. Stick­ing to the cur­rent Wiki­text does not make sense to me.

  7. James

    Con­grat­u­la­tions Han­nes and Dirk!

    I guess the ques­tion is: what per­cent­age of exist­ing Wikipedia pages does this parse “cor­rect­ly”?

  8. Dirk Riehle Post author

    Hey James, thanks! We didn’t check, but there clear­ly remains leg­work to be done. It won’t be an open source project if we wouldn’t be look­ing for vol­un­teers to help 🙂 How­ev­er, we have made it past the sum­mit and are run­ning down­hill.

    Frankly, the Wiki­text peo­ple write is real­ly fun(ny). Take a look at the Saxby Cham­b­liss page on WP! There is a table split over mul­ti­ple tem­plates 🙂

  9. Dirk Riehle Post author

    @James PS: There is no cor­rect pars­ing, as your quo­ta­tion marks imply. The cur­rent Medi­aWiki parser hap­pi­ly cre­ates bro­ken HTML (e.g. an HREF link that cross­es table cells) and leaves it to the browser as the final arbiter.

  10. Thomas Luce

    This is real­ly cool, Dirk. I remem­ber hear­ing some­thing about this before, and is a great direc­tion to go! We built kiwi ( for slight­ly dif­fer­ent rea­sons but in a sim­i­lar vein. We took a dif­fer­ent approach, but I love the idea of shift­ing how you think of the prob­lem as the key to solv­ing it.

    Con­grats again!

  11. Dirk Riehle Post author

    Thanks, @Thomas. We met Karl Math­i­as at the Wiki­me­dia sum­mit, so I believe the Kiwi goal was to ren­der prop­er HTML? Our goal was noth­ing like that, though it fol­lows as an option. We real­ly want to get to the con­tent, inde­pen­dent of a par­tic­u­lar pur­pose.

  12. Pingback: Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining

  13. Pingback: Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 « Byte Mining | IT Lyderis

  14. Pingback: IM U » Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

  15. Pingback: 坑爹吃货 | Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Leave a Reply