The Commit Size Distribution of Open Source Software (Summary)

We finished our work on modeling the commit sizes of open source, called the commit size distribution of open source. This is relevant work for anyone who’d like to know how much code developers are writing for a single commit (code contribution) to a project. For example, if you are developing software development tools, you might want to know about this. The work pairs nicely with an upcoming publication on the commit frequency of open source, that is, the ETA (estimated time of arrival) of the next commit to a project.

There are three papers, in descending order of importance:

  1. A Model of the Commit Size Distribution of Open Source (2013)
  2. Developer Belief vs. Reality on Commit Size Distribution (2012)
  3. The Original Data behind the Commit Size Distribution (2009)

In classic academic terms, paper #3 from 2009 is an early paper that shows the original data, while paper #1 from 2013 provides a high-quality mathematical model of that data, which is not included in the first paper. Paper #2 shows that software developers are generally not aware of this data and their intuitions, when designing software development tools, are off.

I have one gripe with this line of work, and it relates to the academic publication process. We submitted papers #1 and #2 to high-profile publication venues. They got rejected on the grounds that nothing new was being provided when compared with paper #3. This still boggles my mind, given that the papers are clearly different. How can a reviewer not understand the difference between the graphing of raw data (paper #3) and a high-quality mathematical model of that same data (paper #1)?

I’m happy we finished this line of work.

Leave a Reply