Technical Report on WOM: An Object Model for Wikitext

Abstract: Wikipedia is a rich encyclopedia that is not only of great use to its contributors and readers but also to researchers and providers of third party software around Wikipedia. However, Wikipedia’s content is only available as Wikitext, the markup language in which articles on Wikipedia are written, and whoever needs to access the content of an article has to implement their own parser or has to use one of the available parser solutions. Unfortunately, those parsers which convert Wikitext into a high-level representation like an abstract syntax tree (AST) define their own format for storing and providing access to this data structure. Further, the semantics of Wikitext are only defined implicitly in the MediaWiki software itself. This situation makes it difficult to reason about the semantic content of an article or exchange and modify articles in a standardized and machine-accessible way. To remedy this situation we propose a markup language, called XWML, in which articles can be stored and an object model, called WOM, that defines how the contents of an article can be read and modified.

Keywords: Wiki, Wikipedia, Wikitext, Wikitext Parser, Open Source, Sweble, Mediawiki, Mediawiki Parser, XWML, HTML, WOM

Reference: Hannes Dohrn and Dirk Riehle. WOM: An Object Model for Wikitext. University of Erlangen, Technical Report CS-2011-05 (July 2011).

The technical report is available as a PDF file.

On the Open Cloud Principles: Every Real-World Specification is an Underspecification

Trying to wrap my head around the Open Cloud Principles put out by the revamp of the Open Cloud Initiative, I’m happy to note that software engineering research has something to say to the challenges these principles will face.

Every real-world specification is an underspecification.

So, well, I say that, but I doubt that I’m the first one to have learned this from 30+ years of software engineering research. This principle leads us directly to the challenges anyone is facing who is trying to be truthful to the intentions behind the Open Cloud Principles.

Continue reading

Controlling and Steering Open Source Projects

The IEEE just published a short version of the “control points and steering mechanisms” article. Here is the abstract. Please see the original for more details.

Abstract: Open source software has become an important part of the software business. In a 2009 survey, Forrester Research found that 46 percent of all responding enterprises were using or implementing open source software. Moreover, in 2009, the Gartner Group estimated that by 2012, at least 80 percent of all software product firms will use open source software. Thus, it’s important to understand how software product firms depend on open source and how they manage that dependency to meet their business goals. There are three main types of software product firms. [...]

Continue reading

The Evolution of the Source Forge Home Page

I was revising my talk on “Inner Source” when it occurred to me that it might be fun to review the changes to the sf.net (Sourceforge) homepage. Please find my collection of screenshots below. I only started saving them in 2007 so pointers to more and older screenshots are welcome! (In particular if they come with a CC license so that I can use them in talks, attribution is a given. I trust that Geek.net does not object…) Thanks!

Continue reading

Upcoming Talk: The Open Source Volunteering Process

Title The Open Source Volunteering Process
Abstract Open source projects critically depend on bringing new project members on board speedily and effectively. In this talk, I’ll describe the open source volunteering and on-boarding process. I’ll discuss the roles people play and the practices they follow, and I’ll illustrate how this process works by showing the open source software development tools that support it.
Speaker Prof. Dr. Dirk Riehle, University of Erlangen-Nürnberg
Date 14.07.11, 18:30(-20:00)
Location Cogneon GmbH, Henkestr. 91, Erlangen
More GfWM Website Announcement (in German)

The Java IP Story

Every year, I teach the AMOS class, a lab course on “Agile Methods and Open Source” that combines lectures with a real software project that ideally turns into a startup (see the AMOS Project concept, in German). To explain open source, I have to introduce students to intellectual property rights, of which most have been blissfully unaware of until then. Nothing teaches concepts better than a colorful story, and so I have been using the IP strategies around Java to make this dry topic come alive. For fun, comments, and corrections, I’m providing the short version of my talk below, including commentary. (You can also download a PDF version of the talk, licensed as CC-BY 3.0. If you find this useful for teaching, please tell me.) Students at this point have a basic working understanding of intellectual property and exclusion rights. Please let me know what you think! Finally, IANAL.

Java is an important technology powering the modern web and in particular enterprise applications. It has a checkered intellectual property history, and with the recent acquisition of Sun, the Java creator and owner, by Oracle, things only stand to heat up. This slide set discusses some of the more interesting issues around Java intellectual property and its strategic use in business.

  1. What is Java?
  2. Short Java IP Story Time-Line
  3. Three Substories
  4. Java’s Challenge to the Windows Platform
  5. Microsoft and Java
  6. The OpenJDK Strategy (Open Core Model)
  7. Certification of Compatible Implementations
  8. Threats to Commercial Revenue
  9. Main Tools to Curtail “Competitors”
  10. Problems for Alternative Implementations
  11. Problems for OpenJDK Forks
  12. Thank you! and References

Continue reading

The Open Source Big Bang

Open source is not only software, but also an approach to software development. The public nature of open source projects lets us show how open source software development scales to the largest project sizes. The following figure illustrates the scalability of open source software development. I call it the big bang of open source.

Continue reading

The Open Source Innovation and Commoditization Frontier

Following up on Matt Aslett’s excellent post about the growth of permissive licenses and a short discussion about it on my research group’s blog, I wanted to suggest here a thought about the ratio of new vendor-owned vs. community-owned open source projects. I’m ignoring existing projects because of their path dependence (read: only today do we know what we are doing). My point is being illustrated by the following figure that I occasionally use:

Continue reading

New Talk: How and Why IT User Companies Sponsor Open Source

New talk! For German, see below. Other stock talks here. If you are interested in this talk, feel free to contact me.

Topics Open source, IT user company, open source foundation, sponsored open source
Audience CIO, CFO, product manager, project manager
Format 45min talk, 60min talk
Level Intermediate

Continue reading

The Parser that Cracked the MediaWiki Code

I am happy to announce that we finally open sourced the Sweble Wikitext parser. You can find the announcement on the OSR Group blog or directly on the Sweble project site. This is the work of Hannes Dohrn, my first Ph.D. student, who I hired in 2009 to implement a Wikitext parser.

So what about this “cracking the MediaWiki code”?

Wikipedia aims to bring the (encyclopedic) knowledge of the world to all of us, for free. While already ten years old, the Wikipedia community is just getting started, and we have barely seen the tip of the iceberg, there is so much more to come. All that wonderful content is being written by volunteers using a (seemingly) simple language called Wikitext (the stuff you type in once you click on edit). Until today, Wikitext had been poorly defined.

Continue reading