Dirk Riehle's Industry and Research Publications

On the technology behind the Wikipedia sexism debate on “American women novelists”

The English Wikipedia is currently embroiled in a debate on sexism (local copy), because of classifying female American novelists as “American Women Novelists” while leaving male American novelists in the more general category “American Novelists”, suggesting a subordinate role of female novelists. I find this debate regrettable for the apparent sexism but also interesting for the technology underlying such changes, which I would like to focus on here.

With technology, I mean bureaucratic practices, conceptual modeling of the world and Wikipedia content, and software tools to support changes to those models.

First, who made the decision to move the articles on female novelists to an “American Women Novelist” category? Wikipedia bureaucracy is structured into “improvement projects”, groups of volunteers who care about a particular topic, and work together to improve this topic’s presentation on Wikipedia. It is part of the large bureaucratic underbelly of Wikipedia that few care to look at (who are not editors). So in most cases, there would be an American Novelist improvement group. Any complaints should be directed to that group. (And not the Wikimedia Foundation, the non-profit which operates Wikipedia. The various Wikipedia communities have been very clear that they don’t want the Wikimedia Foundation to meddle with content. As a consequence, the Wikimedia Foundation only interferes with grave violations of conduct or if laws are being violated.) In this particular case, I have not been able to figure out who actually is behind these changes. Anyway, let’s assume there is such an improvement project.

This group decided that “American Women Novelists” is a good new category. The original motivation was to keep categories small, which may or may not have been a reasonable decision. But the choice of the category term shows all the things that are wrong with the category system, leading to wrong or misleading labeling, and [sarcasm on] may well lead to war between nations in the future [sarcasm off] as Wikipedia may become the oracle of all things knowledge.

The first thing that is wrong is that next to “American Women Novelists” there obviously should be “American Men Novelists”. I’m sure there should also be an “American Other Novelists” category to not exclude the LGBT community. That third term is obviously clumsy and makes clear that a better choice of name would have been “female/male/other American Novelists”.

Considering this, it becomes clear that the category name mixes various things. “Novelist” as a role people play, “female/male/other” as properties of people, and the implicit classification that the article is about a person (as derived from the role “novelist” which applies to humans only and not to mammals or cities or t-shirts). If the current approach was a proper approach to classification, a growing body of knowledge on Wikipedia would soon have us introducing categories like “tall slender blonde other American Novelist”. (I trust the common sense of the respective improvement project not to follow this example.)

A more thorough approach to categorizing the content of articles on Wikipedia would rely on a model of content categories. For articles about people, the central concept would likely be person, an identifiable object, with roles that a person performs in various contexts, for example, mother or father, novelist or publisher or bartender or race car driver or nurse. Properties like male or female, birth day or age would be broken out and attached to the appropriate concept, whether a central or auxilliary concept. In fact, the Wiki Data project tries to do just that.

Computer science is a discipline that has long worked on such challenges. There are two main approaches, a more rigid and precise one (conceptual modeling or object-orientation), and a fast-and-loose one based on graph theory (ontologies). The second one is the one being favored by Wikipedia. I’m in the first camp and believe that a model should have defined semantics, but I recognize the value that the second approach provides, namely that you don’t have to think too deeply while making changes. Wikipedia is always “good enough” and where it isn’t, you fix it up after the fact rather than getting it right immediately.

Which brings me to tool support, finally. In Wikipedia, categories are just textual labels. There could be anything in there. A single typo and you fall out of the category. Today, there is no single place where you could change a category name. Rather, you need to send out a group of volunteers to look at every page and change the label by hand. This easily can take days if not weeks. This is why in the sexism debate, the original reporter complained about a gradual change rather than suddenly being confronted with an instantaneous change. The way the Wikipedia technology has been built, such changes can’t be implemented instantaneously.

Over at the Sweble project, my research group has been developing technology that lets editors make wide-ranging changes like renaming a category at the click of a button.

This includes instantaneously reverting mistakes.

PS: My Ph.D. student wants me to note that his work has been tested in the laboratory only and not on the real Wikipedia. We use a different underlying technology (Java rather than php) so adapting our work to Wikipedia would take substantial effort.


  1. obiwan Avatar

    interested in your thoughts on this approach- category intersections:

  2. John Broughton Avatar
    John Broughton

    This is *not* true: “Wikipedia bureaucracy is structured into ‘improvement projects’, groups of volunteers who care about a particular topic, and work together to improve this topic’s presentation on Wikipedia. … So in most cases, there would be an American Novelist improvement group. Any complaints should be directed to that group.”
    By “improvement projects” you must mean WikiProjects. However, (a) WikiProjects don’t “own” or otherwise control the articles associated with them; (b) most editors don’t belong to *any* WikiProject, and those who do belong do not in any way restrict their editing to associated articles; (c) most WikiProjects are inactive (a polite way of saying “defunct”); (d) virtually all active WikiProjects have far too many associated articles to be able to monitor all of them, even if members were so inclined (they are not); rather, they tend to work on a very small subset at any given time.
    And no, there is no American Novelist improvement group. There is a WikiProject Novels group, but it is (a) focused on books, not people, and (b) not limited by geography.
    And this is flatly wrong: “This group decided that ‘American Women Novelists’ is a good new category. In fact, *no* group made that decision; it was made by a single person. And yes, Wikipedia is so decentralized that a single person can not only create a new category, but can then use semi-automated tools to do fairly high-speed edits to move articles from older categories to the new one.

    1. Dirk Riehle Avatar

      @john I don’t see how you are contradicting what I said; I read you repeating it.
      More specifically, I didn’t say these improvement projects are all there is. Of course not, many editors are not associated with any projects, some are associated with several.
      When you look at online communities, you see a series of roles that people play and take on over time. From pure content focus to structural focus to admin focus. Which is to say you people move on from caring about particular topics to caring about the whole thing. Members of wiki projects are somewhere in between—beyond single articles, into improving whole categories, not yet running back office processes.
      As to who did this: I couldn’t figure it out clearly, nor could Jimmy Wales or folks from the Wikimedia foundation, as various emails showed. So I only said “let’s assume” it is “the usual culprit” which is good enough for my focus on technology that needs improvement. Refactoring categories is a typical improvement project activity.

  3. Nemo Avatar

    On the Italian Wikipedia, these categories are defined by https://it.wikipedia.org/wiki/Template:Bio
    Crowded categories like https://it.wikipedia.org/wiki/Categoria:Scrittori_italiani are disambiguated by century; all distinctions are applied via the central template for our 220+ thousands articles.
    This is a tool, but the point is that the rules for biographies are defined centrally, by the biographies WikiProject; no such tool would exist if such a coordination hadn’t existed. On en.wiki they’re more chaotic, so categories are more inconsistent and a consensus has to be found after the fact to reorganise them.

    1. bawolff Avatar

      As an aside, the whole categories can’t be easily renamed thing will hopefully be fixed somewhat soon in mediawiki (I would really like to see it fixed by the end of this summer)

  4. Adrian Kuhn Avatar

    Interesting post. I find the motivation to “keep categories small” absolutely mind blowing. So apparently, Wikipedia’s bureaucratic underbelly has put an upper limit on the number of American novelists and when there’s to many, some of them have to be segregated to another category? Sounds to me like a UX problem with categories becoming hard to navigate when they are large, so maybe that should be fixed rather than resorting apartheid …

    1. Dirk Riehle Avatar

      In my mind, this is all object-oriented modeling of the world. Having subclasses (categories) of Novelists makes sense. Following the abstract superclass rule you’d even want the superclass (more general category) to be empty. But you are exactly correct, that use cases determine how stuff gets categorized rather than that clear conceptual modeling determines how stuff should get categorized after which different tools allow for different use cases.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.



Share the joy

Share on LinkedIn

Share by email

Share on X (Twitter)

Share on WhatsApp

Featured startups

QDAcity makes qualitative research and qualitative data analysis fun and easy.
EDITIVE makes inter- and intra-company document collaboration more effective.

Featured projects

Making free and open data easy, safe, and reliable to use
Bringing business intelligence to engineering management
Making open source in products easy, safe, and fun to use