Some Progress on Wikipedia Editing

Wikipedia has long been suffering from its rather raw “wiki markup” editing experience. The reason is that the underlying software is stuck in the mud and any progress is slow and painful. Right now there is some excitement over progress on the “visual editor” of Mediawiki. As you can see in the video below the look and feel is 2016, while the functionality is still 1999. How we will catch-up with Google Docs or Medium or any reasonable editing experience this way remains a mystery to me.

World Views Are Not Data Inconsistencies

I’m at Wikimania 2013, listening in on the WikiData session. WikiData is the Wikimedia Foundation’s attempt to go beyond prose in Wikipedia pages and provide a reference data source. An obvious problem is that any such data source needs an underlying model of the world, and that sometimes it is not only hard to gain consensus on that model, sometimes it is impossible. Basically, different world-views are simply incompatible. When asked about this fundamental problem, the audience was told that such inconsistencies are handled using multi-valued properties. Ignoring for a second, that world-views cannot be reduced to individual properties, my major point here is that world-views are not inconsistencies in the data. Different world-views are real and justified, and there will never be only one world view. The moment we all agree on one world-view, we have become the borg.

Update: Daniel Kinzler corrected me that this must be a misunderstanding: WikiData can handle multiple world views well by way of multi-valued properties.

Design and Implementation of Wiki Content Transformations and Refactorings

Abstract: The organic growth of wikis requires constant attention by contributors who are willing to patrol the wiki and improve its content structure. However, most wikis still only offer textual editing and even wikis which offer WYSIWYG editing do not assist the user in restructuring the wiki. Therefore, “gardening” a wiki is a tedious and error-prone task. One of the main obstacles to assisted restructuring of wikis is the underlying content model which prohibits automatic transformations of the content. Most wikis use either a purely textual representation of content or rely on the representational HTML format. To allow rigorous definitions of transformations we use and extend a Wiki Object Model. With the Wiki Object Model installed we present a catalog of transformations and refactorings that helps users to easily and consistently evolve the content and structure of a wiki. Furthermore we propose XSLT as language for transformation specification and provide working examples of selected transformations to demonstrate that the Wiki Object Model and the transformation framework are well designed. We believe that our contribution significantly simplifies wiki “gardening” by introducing the means of effortless restructuring of articles and groups of articles. It furthermore provides an easily extensible foundation for wiki content transformations.

Keywords: Wiki, Wiki Markup, WM, Wiki Object Model, WOM, Transformation, Refactoring, XML, XSLT, Sweble

Reference: Hannes Dohrn, Dirk Riehle. “Design and Implementation of Wiki Content Transformations and Refactorings.” In Proceedings of the 9th International Symposium on Open Collaboration (WikiSym + OpenSym 2013). ACM, 2013.

The paper is available as a PDF file.

On the Technology Behind the Wikipedia Sexism Debate on “American Women Novelists”

The English Wikipedia is currently embroiled in a debate on sexism (local copy), because of classifying female American novelists as “American Women Novelists” while leaving male American novelists in the more general category “American Novelists”, suggesting a subordinate role of female novelists. I find this debate regrettable for the apparent sexism but also interesting for the technology underlying such changes, which I would like to focus on here.

With technology, I mean bureaucratic practices, conceptual modeling of the world and Wikipedia content, and software tools to support changes to those models.

Continue reading “On the Technology Behind the Wikipedia Sexism Debate on “American Women Novelists””

Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia

Abstract: The heart of each wiki, including Wikipedia, is its content. Most machine processing starts and ends with this content. At present, such processing is limited, because most wiki engines today cannot provide a complete and precise representation of the wiki’s content. They can only generate HTML. The main reason is the lack of well-defined parsers that can handle the complexity of modern wiki markup. This applies to MediaWiki, the software running Wikipedia, and most other wiki engines. This paper shows why it has been so difficult to develop comprehensive parsers for wiki markup. It presents the design and implementation of a parser for Wikitext, the wiki markup language of MediaWiki. We use parsing expression grammars where most parsers used no grammars or grammars poorly suited to the task. Using this parser it is possible to directly and precisely query the structured data within wikis, including Wikipedia. The parser is available as open source from http://sweble.org.

Keywords: Wiki, Wikipedia, Wiki Parser, Wikitext Parser, Parsing Expression Grammar, PEG, Abstract Syntax Tree, AST, WYSIWYG, Sweble.

Reference: Hannes Dohrn and Dirk Riehle. “Design and Implementation of the Sweble Wikitext Parser: Unlocking the Structured Data of Wikipedia.” In Proceedings of the 7th International Symposium on Wikis and Open Collaboration (WikiSym 2011). ACM Press, 2011. Page 72-81.

The paper is available as a PDF file (preprint).

Technical Report on WOM: An Object Model for Wikitext

Abstract: Wikipedia is a rich encyclopedia that is not only of great use to its contributors and readers but also to researchers and providers of third party software around Wikipedia. However, Wikipedia’s content is only available as Wikitext, the markup language in which articles on Wikipedia are written, and whoever needs to access the content of an article has to implement their own parser or has to use one of the available parser solutions. Unfortunately, those parsers which convert Wikitext into a high-level representation like an abstract syntax tree (AST) define their own format for storing and providing access to this data structure. Further, the semantics of Wikitext are only defined implicitly in the MediaWiki software itself. This situation makes it difficult to reason about the semantic content of an article or exchange and modify articles in a standardized and machine-accessible way. To remedy this situation we propose a markup language, called XWML, in which articles can be stored and an object model, called WOM, that defines how the contents of an article can be read and modified.

Keywords: Wiki, Wikipedia, Wikitext, Wikitext Parser, Open Source, Sweble, Mediawiki, Mediawiki Parser, XWML, HTML, WOM

Reference: Hannes Dohrn and Dirk Riehle. WOM: An Object Model for Wikitext. University of Erlangen, Technical Report CS-2011-05 (July 2011).

The technical report is available as a PDF file.

The Parser That Cracked The MediaWiki Code

I am happy to announce that we finally open sourced the Sweble Wikitext parser. You can find the announcement on my research group’s blog or directly on the Sweble project site. This is the work of Hannes Dohrn, my first Ph.D. student, who I hired in 2009 to implement a Wikitext parser.

So what about this “cracking the MediaWiki code”?

Wikipedia aims to bring the (encyclopedic) knowledge of the world to all of us, for free. While already ten years old, the Wikipedia community is just getting started, and we have barely seen the tip of the iceberg, there is so much more to come. All that wonderful content is being written by volunteers using a (seemingly) simple language called Wikitext (the stuff you type in once you click on edit). Until today, Wikitext had been poorly defined.

Continue reading “The Parser That Cracked The MediaWiki Code”

Learning from Wikipedia: Open Collaboration within Corporations

Wikipedia is the free online encyclopedia that has taken the Internet by storm. It is written and administered solely by volunteers. How exactly did this come about and how does it work? Can it keep working? And maybe more importantly, can you transfer its practices to the workplace to achieve similar levels of dedication and quality of work? In this presentation I describe the structure, processes and governance of Wikipedia and discuss how some of its practices can be transferred to the corporate context.

This presentation represents the next step in the evolution of two Wikimania tutorials/workshops, see Presentations/Tutorials. If the slideshow doesn’t play, please use the PDF file download below.

Reference: Dirk Riehle. “Learning from Wikipedia: Open Collaboration within Corporations.” Invited talk at Talk the Future 2008. Krems, Austria: 2008.

The slides are available as a PDF file.

A Grammar for Standardized Wiki Markup

Authors: Martin Junghans, Dirk Riehle, Rama Gurram, Matthias Kaiser, Mário Lopes, Umit Yalcinalp

Abstract: Today’s wiki engines are not interoperable. The rendering engine is tied to the processing tools which are tied to the wiki editors. This is an unfortunate consequence of the lack of rigorously specified standards. This paper discusses an EBNF-based grammar for Wiki Creole 1.0, a community standard for wiki markup, and demonstrates its benefits. Wiki Creole is being specified using prose, so our grammar revealed several categories of ambiguities, showing the value of a more formal approach to wiki markup specification. The formalization of Wiki Creole using a grammar shows performance problems that today’s regular-expression-based wiki parsers might face when scaling up. We present an implementation of a wiki markup parser and demonstrate our test cases for validating Wiki Creole parsers. We view the work presented in this paper as an important step towards decoupling wiki rendering engines from processing tools and from editing tools by means of a precise and complete wiki markup specification. This decoupling layer will then allow innovation on these different parts to proceed independently and as is expected at a faster pace than before.

Reference: In Proceedings of the 2008 International Symposium on Wikis (WikiSym ’08). ACM Press, 2008: Article No. 21.

Available as a PDF file.