On the technology behind the Wikipedia sexism debate on “American women novelists”

The English Wikipedia is currently embroiled in a debate on sexism (local copy), because of classifying female American novelists as “American Women Novelists” while leaving male American novelists in the more general category “American Novelists”, suggesting a subordinate role of female novelists. I find this debate regrettable for the apparent sexism but also interesting for the technology underlying such changes, which I would like to focus on here.

With technology, I mean bureaucratic practices, conceptual modeling of the world and Wikipedia content, and software tools to support changes to those models.

First, who made the decision to move the articles on female novelists to an “American Women Novelist” category? Wikipedia bureaucracy is structured into “improvement projects”, groups of volunteers who care about a particular topic, and work together to improve this topic’s presentation on Wikipedia. It is part of the large bureaucratic underbelly of Wikipedia that few care to look at (who are not editors). So in most cases, there would be an American Novelist improvement group. Any complaints should be directed to that group. (And not the Wikimedia Foundation, the non-profit which operates Wikipedia. The various Wikipedia communities have been very clear that they don’t want the Wikimedia Foundation to meddle with content. As a consequence, the Wikimedia Foundation only interferes with grave violations of conduct or if laws are being violated.) In this particular case, I have not been able to figure out who actually is behind these changes. Anyway, let’s assume there is such an improvement project.

This group decided that “American Women Novelists” is a good new category. The original motivation was to keep categories small, which may or may not have been a reasonable decision. But the choice of the category term shows all the things that are wrong with the category system, leading to wrong or misleading labeling, and [sarcasm on] may well lead to war between nations in the future [sarcasm off] as Wikipedia may become the oracle of all things knowledge.

The first thing that is wrong is that next to “American Women Novelists” there obviously should be “American Men Novelists”. I’m sure there should also be an “American Other Novelists” category to not exclude the LGBT community. That third term is obviously clumsy and makes clear that a better choice of name would have been “female/male/other American Novelists”.

Considering this, it becomes clear that the category name mixes various things. “Novelist” as a role people play, “female/male/other” as properties of people, and the implicit classification that the article is about a person (as derived from the role “novelist” which applies to humans only and not to mammals or cities or t-shirts). If the current approach was a proper approach to classification, a growing body of knowledge on Wikipedia would soon have us introducing categories like “tall slender blonde other American Novelist”. (I trust the common sense of the respective improvement project not to follow this example.)

A more thorough approach to categorizing the content of articles on Wikipedia would rely on a model of content categories. For articles about people, the central concept would likely be person, an identifiable object, with roles that a person performs in various contexts, for example, mother or father, novelist or publisher or bartender or race car driver or nurse. Properties like male or female, birth day or age would be broken out and attached to the appropriate concept, whether a central or auxilliary concept. In fact, the Wiki Data project tries to do just that.

Computer science is a discipline that has long worked on such challenges. There are two main approaches, a more rigid and precise one (conceptual modeling or object-orientation), and a fast-and-loose one based on graph theory (ontologies). The second one is the one being favored by Wikipedia. I’m in the first camp and believe that a model should have defined semantics, but I recognize the value that the second approach provides, namely that you don’t have to think too deeply while making changes. Wikipedia is always “good enough” and where it isn’t, you fix it up after the fact rather than getting it right immediately.

Which brings me to tool support, finally. In Wikipedia, categories are just textual labels. There could be anything in there. A single typo and you fall out of the category. Today, there is no single place where you could change a category name. Rather, you need to send out a group of volunteers to look at every page and change the label by hand. This easily can take days if not weeks. This is why in the sexism debate, the original reporter complained about a gradual change rather than suddenly being confronted with an instantaneous change. The way the Wikipedia technology has been built, such changes can’t be implemented instantaneously.

Over at the Sweble project, my research group has been developing technology that lets editors make wide-ranging changes like renaming a category at the click of a button.

This includes instantaneously reverting mistakes.

PS: My Ph.D. student wants me to note that his work has been tested in the laboratory only and not on the real Wikipedia. We use a different underlying technology (Java rather than php) so adapting our work to Wikipedia would take substantial effort.

Posted on

2013-04-28