WOM: An Object Model for Wikitext [Technical Report]

Abstract: Wikipedia is a rich encyclopedia that is not only of great use to its contributors and readers but also to researchers and providers of third party software around Wikipedia. However, Wikipedia’s content is only available as Wikitext, the markup language in which articles on Wikipedia are written, and whoever needs to access the content of an article has to implement their own parser or has to use one of the available parser solutions. Unfortunately, those parsers which convert Wikitext into a high-level representation like an abstract syntax tree (AST) define their own format for storing and providing access to this data structure. Further, the semantics of Wikitext are only defined implicitly in the MediaWiki software itself. This situation makes it difficult to reason about the semantic content of an article or exchange and modify articles in a standardized and machine-accessible way. To remedy this situation we propose a markup language, called XWML, in which articles can be stored and an object model, called WOM, that defines how the contents of an article can be read and modified.

Keywords: Wiki, Wikipedia, Wikitext, Wikitext Parser, Open Source, Sweble, Mediawiki, Mediawiki Parser, XWML, HTML, WOM

Reference: Hannes Dohrn and Dirk Riehle. WOM: An Object Model for Wikitext. University of Erlangen, Technical Report CS-2011-05 (July 2011).

The technical report is available as a PDF file.

The Parser That Cracked The MediaWiki Code

I am happy to announce that we finally open sourced the Sweble Wikitext parser. You can find the announcement on my research group’s blog or directly on the Sweble project site. This is the work of Hannes Dohrn, my first Ph.D. student, who I hired in 2009 to implement a Wikitext parser.

So what about this “cracking the MediaWiki code”?

Wikipedia aims to bring the (encyclopedic) knowledge of the world to all of us, for free. While already ten years old, the Wikipedia community is just getting started, and we have barely seen the tip of the iceberg, there is so much more to come. All that wonderful content is being written by volunteers using a (seemingly) simple language called Wikitext (the stuff you type in once you click on edit). Until today, Wikitext had been poorly defined.

Continue reading “The Parser That Cracked The MediaWiki Code”

Revamping German Copyright Law #EIDG

The German Enquete commission “Internet and Digital Society” is a multilateral commission instituted by the German parliament to discuss and make recommendations on, well, Internet and digital society. I’m a member of an expert advisory council for one of the parties involved in the commission. I received the following catalog of questions and thought I’d share the questions here and maybe we can have a good discussion. For international readers, it may be helpful to read Wikipedia on German copyright law. So, here are the questions.

Continue reading “Revamping German Copyright Law #EIDG”

My Position on Privacy (Seven Things About Me)

Stormy Peters recently tagged me to post seven items about my life. This is a “viral” pyramid scheme; you are supposed to write these seven items and then tag seven other people to do the same. It is not the first time I got such a request; I also got tagged on Facebook to post 25 items about my life, and in general it is quite tempting to let your personal thoughts hang out on a blog like this.

I usually ignore such requests for reasons of privacy. Everything you do or say on the Internet can be used at some future point in time. The saying “on the Internet, nobody knows you are a dog” is completely wrong; on the Internet anyone with enough resources cannot only know you are a dog but can also know everything about you down to hereditary diseases—even things you may not know yourself. Or, as Scott McNealy is famous for saying: “You have no privacy. Get over it.”

Here then seven things about my take at privacy in the Internet age:

Continue reading “My Position on Privacy (Seven Things About Me)”

Learning from Wikipedia: Open Collaboration within Corporations

Wikipedia is the free online encyclopedia that has taken the Internet by storm. It is written and administered solely by volunteers. How exactly did this come about and how does it work? Can it keep working? And maybe more importantly, can you transfer its practices to the workplace to achieve similar levels of dedication and quality of work? In this presentation I describe the structure, processes and governance of Wikipedia and discuss how some of its practices can be transferred to the corporate context.

This presentation represents the next step in the evolution of two Wikimania tutorials/workshops, see Presentations/Tutorials. If the slideshow doesn’t play, please use the PDF file download below.

[slideshare id=585293&doc=learningfromwikipedia-1220696606582931-9]

Reference: Dirk Riehle. “Learning from Wikipedia: Open Collaboration within Corporations.” Invited talk at Talk the Future 2008. Krems, Austria: 2008.

The slides are available as a PDF file.

A Grammar for Standardized Wiki Markup

Authors: Martin Junghans, Dirk Riehle, Rama Gurram, Matthias Kaiser, Mário Lopes, Umit Yalcinalp

Abstract: Today’s wiki engines are not interoperable. The rendering engine is tied to the processing tools which are tied to the wiki editors. This is an unfortunate consequence of the lack of rigorously specified standards. This paper discusses an EBNF-based grammar for Wiki Creole 1.0, a community standard for wiki markup, and demonstrates its benefits. Wiki Creole is being specified using prose, so our grammar revealed several categories of ambiguities, showing the value of a more formal approach to wiki markup specification. The formalization of Wiki Creole using a grammar shows performance problems that today’s regular-expression-based wiki parsers might face when scaling up. We present an implementation of a wiki markup parser and demonstrate our test cases for validating Wiki Creole parsers. We view the work presented in this paper as an important step towards decoupling wiki rendering engines from processing tools and from editing tools by means of a precise and complete wiki markup specification. This decoupling layer will then allow innovation on these different parts to proceed independently and as is expected at a faster pace than before.

Reference: In Proceedings of the 2008 International Symposium on Wikis (WikiSym ’08). ACM Press, 2008: Article No. 21.

Available as a PDF file.

Bringing Wikipedia to Work: Open Collaboration within Corporations

This upcoming Wikimania 2008 tutorial discusses the three principles of “open collaboration” which I believe are underlying wikis, open source, and other forms of peer production. It is a follow-up to last year’s tutorial about open collaboration at Wikimania 2007.

If the slideshow doesn’t play, please use the PDF file download below.

[slideshare id=437792&doc=opencollaboration-1212166751781014-9]

Reference: Dirk Riehle. “Bringing Wikipedia to Work: Open Collaboration in Corporations.” In Proceedings of Wikimania 2008, forthcoming.

Also available as a PDF file.

Wiki Creole Grammar, Schema, Transformations Made Available

For wiki research purposes as well as the Wiki Creole community‘s convenience, we are making our EBNF grammar, the XML schema definition, and the to/from XML transformations available. You can use these specifications to create your own wiki parsers (using parser generators) as well as use standard technology (DOM, XSLT) to work with wiki pages and display or save them.

For more, see the dedicated wiki-creole page.

An XML Interchange Format for Wiki Creole 1.0 [Technical Report]

Abstract: Wikis have become an important application on the web and in the enterprise, yet there are no interoperability standards between different wiki engines. We present the first complete XML representation format of Wiki Creole 1.0. Wiki Creole is a community standard for wiki markup, the language used to write wiki pages. This report presents the complete XML representation format using a validating XML schema. In addition we present XSLT definitions for transforming the XML representations to XHTML on the one hand and for transforming the XML representations to Wiki Creole markup on the other hand. Our work shows how using XML technologies we can make wiki interchange, wiki upgrading, and wiki conversion independent from a specific wiki engine implementation.

Reference: Martin Junghans, Dirk Riehle, Umit Yalcinalp. In ACM SIGWEB Newsletter, Volume 2007, Issue Winter (Winter 2007), Article No. 5. ACM Press, 2007.

Available as a PDF file.

An EBNF Grammar for Wiki Creole 1.0 [Technical Report]

Abstract: Today’s wiki engines are not interoperable. This is an unfortunate consequence of the lack of rigorously specified standards. This technical report presents a complete and validated EBNF-based grammar for Wiki Creole, a community standard for wiki markup. Wiki Creole is also the only standard currently available. Wiki Creole is being specified using prose, leading to inconsistencies and ambiguities. Our grammar uncovered those ambiguities which we fed back into the specification process. The Wiki Creole grammar presented in this report makes the creation of Wiki Creole parsers simple using parser generators, ANTLR in our case. Using a precise specification of wiki markup lets us decouple wiki editors from wiki storage from further wiki processing tools. Based on this decoupling layer we expect innovation on these different parts to proceed independently and at a faster pace than before.

Reference: Martin Junghans, Dirk Riehle, Rama Gurram, Matthias Kaiser, Mario Lopes, Umit Yalcinalp. In ACM SIGWEB Newsletter, Volume 2007, Issue Winter (Winter 2007), Article No. 4. ACM Press, 2007.

Available as a PDF file.