I am happy to announce that we finally open sourced the Sweble Wikitext parser. You can find the announcement on my research group’s blog or directly on the Sweble project site. This is the work of Hannes Dohrn, my first Ph.D. student, who I hired in 2009 to implement a Wikitext parser.
So what about this “cracking the MediaWiki code”?
Wikipedia aims to bring the (encyclopedic) knowledge of the world to all of us, for free. While already ten years old, the Wikipedia community is just getting started, and we have barely seen the tip of the iceberg, there is so much more to come. All that wonderful content is being written by volunteers using a (seemingly) simple language called Wikitext (the stuff you type in once you click on edit). Until today, Wikitext had been poorly defined.
There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers. That’s why a respected long-time community member asked in exasperation: Can anyone really edit Wikipedia?
The common answer given is to hide the complexity of Wikitext behind a visual editor, but that is not an answer. It doesn’t work: A visual editor, like any other algorithm that wants to work with Wikipedia content, needs a well-understood specification of the language that content is written in. This is where the Sweble parser comes in. Following well-understood computer science best practices, it uses a well-defined grammar that a parser generator uses to create a parser. It uses well-understood object-oriented design patterns (the Visitor pattern, prominently) to build a processing pipeline that transforms source Wikitext into whatever the desired output format is. And most importantly, it defines an abstract syntax tree (AST), document object model (DOM) tree soon, and works off that tree. We have come a long way from 5000 lines of php code.
So what does creating an AST and DOM tree for Wikitext buy us?
In short, it buys us interoperability and evolvability. In a 2007 paper, using the then hopeful wiki markup community standard WikiCreole, we explained the need for such interoperability and evolvability as defined by an open standard. Different tools can gather around that format and evolve independently. Today, everything has to go lock-step with MediaWiki. By untying the content and data from MediaWiki, we are enabling an ecosystem of tools and technology around Wikipedia (and related project) content so these projects can gain more speed and breadth.
Leave a Reply