Technical Report on WOM: An Object Model for Wikitext

Abstract: Wikipedia is a rich encyclopedia that is not only of great use to its contributors and readers but also to researchers and providers of third party software around Wikipedia. However, Wikipedia’s content is only available as Wikitext, the markup language in which articles on Wikipedia are written, and whoever needs to access the content of an article has to implement their own parser or has to use one of the available parser solutions. Unfortunately, those parsers which convert Wikitext into a high-level representation like an abstract syntax tree (AST) define their own format for storing and providing access to this data structure. Further, the semantics of Wikitext are only defined implicitly in the MediaWiki software itself. This situation makes it difficult to reason about the semantic content of an article or exchange and modify articles in a standardized and machine-accessible way. To remedy this situation we propose a markup language, called XWML, in which articles can be stored and an object model, called WOM, that defines how the contents of an article can be read and modified.

Keywords: Wiki, Wikipedia, Wikitext, Wikitext Parser, Open Source, Sweble, Mediawiki, Mediawiki Parser, XWML, HTML, WOM

Reference: Hannes Dohrn and Dirk Riehle. WOM: An Object Model for Wikitext. University of Erlangen, Technical Report CS-2011-05 (July 2011).

The technical report is available as a PDF file.

Controlling and Steering Open Source Projects

The IEEE just published a short version of the “control points and steering mechanisms” article. Here is the abstract. Please see the original for more details.

Abstract: Open source software has become an important part of the software business. In a 2009 survey, Forrester Research found that 46 percent of all responding enterprises were using or implementing open source software. Moreover, in 2009, the Gartner Group estimated that by 2012, at least 80 percent of all software product firms will use open source software. Thus, it’s important to understand how software product firms depend on open source and how they manage that dependency to meet their business goals. There are three main types of software product firms. [...]

Continue reading

The Evolution of the Source Forge Home Page

I was revising my talk on “Inner Source” when it occurred to me that it might be fun to review the changes to the sf.net (Sourceforge) homepage. Please find my collection of screenshots below. I only started saving them in 2007 so pointers to more and older screenshots are welcome! (In particular if they come with a CC license so that I can use them in talks, attribution is a given. I trust that Geek.net does not object…) Thanks!

Continue reading

Call for Papers: SoSyM Special Issue on Enterprise Modeling

Call for Papers as PDF

Modern organizations rely on complex configurations of distributed IT systems that implement key business processes, provide databases, data warehousing, and business intelligence. The current business environment requires organizations to comply with a range of externally defined regulations such as Sarbanes-Oxley and BASEL II.

Organizations need to be increasingly agile, robust, and be able to react to complex events, possibly in terms of dynamic reconfiguration.

Continue reading

The Open Source Big Bang

Open source is not only software, but also an approach to software development. The public nature of open source projects lets us show how open source software development scales to the largest project sizes. The following figure illustrates the scalability of open source software development. I call it the big bang of open source.

Continue reading

Plagiarism on the Rise?

I recently reviewed a paper where, a few paragraphs into the introduction, the words seemed strangely familiar. After some cross-checking, I realised that the author of the paper had copied about two paragraphs verbatim from one of my papers. After a bit more digging, I found other places in the paper where the author had copied from other researchers’ work as well. In all cases, no quotation marks had been used nor any reference had been provided. The papers the author had copied from were listed in the reference section though.

Continue reading

The Parser that Cracked the MediaWiki Code

I am happy to announce that we finally open sourced the Sweble Wikitext parser. You can find the announcement on the OSR Group blog or directly on the Sweble project site. This is the work of Hannes Dohrn, my first Ph.D. student, who I hired in 2009 to implement a Wikitext parser.

So what about this “cracking the MediaWiki code”?

Wikipedia aims to bring the (encyclopedic) knowledge of the world to all of us, for free. While already ten years old, the Wikipedia community is just getting started, and we have barely seen the tip of the iceberg, there is so much more to come. All that wonderful content is being written by volunteers using a (seemingly) simple language called Wikitext (the stuff you type in once you click on edit). Until today, Wikitext had been poorly defined.

Continue reading

Rigor vs. Relevance, or: What is the Size of a Dissertation?

While listening to a colleague’s talk the other day, I got an idea for a Ph.D. thesis (grant proposal). I wrote up a short summary and sent it to him. He thought it was fine but commented that it might be a bit “thin”. This made me wonder: How do we determine sufficient size of a dissertation, to stay with the metaphor of thin, so that we can conclude some research work is worth a Ph.D. title? Most university regulations require “significant” (read: non-trivial) scientific progress and then leave it to the advisor and the reading committee to determine whether a submitted dissertation fits the bill.

Continue reading

More Upcoming Talks: Open Source Research

I’ll be presenting the Open Source Research talk repeatedly over the next few months. The next three instances are in China, specifically:

  • Tsinghua University on March 17th, 2011
  • Peking University on March 18th, 2011
  • University of Macau on April 1st, 2011

After that it’s back to Germany.