What is the most common size of code contributions to open source? Maybe 30 lines of source code? 200 lines? Or just one line? What’s your guess?
In a recent paper on the commit size distribution in open source we show that the most common size of code contributions is one line of source code. Of all commits in our more than 8 million strong sample, one-line source code commits represent more than 12%, two-line commits represent 9%, and three-line commits represent 5.5% of all commits. The following figure shows this data.
In general, small commits dominate open source. The following figure shows commits of sizes 1-100 source code lines in our sample population. The 1-100 source code line commits make up more than 83% of all commits. As one can see, it is an almost strictly falling curve. In fact, the paper shows that this curve can be closely modeled by a power law. But for that you have to dig into the paper itself.
What are your thoughts? Are you surprised or is it obvious to? What theories are on your mind? Maybe we have the data to validate or invalidate your hypotheses.
5 Replies to “The Dominance of Small Code Contributions”
Your study nicely confirms everything Clay Shirky says about power law distributions in “Here Comes Everybody”. The results are not surprising – they confirm everything known from other such open social systems, such as those Shirky gives a look in his book – f. ex. photo distributions on Flickr tagged “mermaid parade” or “Iraq”.
An important point about power law distributions (as Shirky also notes in the book) is that the one line code change no. 517 may also be the one which closes a vital security hole. So “closed” systems, such as companies which can only hire so many people, may lose out from a lot of vital input and value, even if it seems at a glance only 20% of contributors are doing the major parts of the work being done.
That’s an important point—speaks to enabling the wisdom of the crowds—i.e. you want diversity in your innovation and input.
Ralph Johnson commented on Plaxo Pulse but I prefer to discuss it here 🙂
From Ralph: “An interesting question is what percent of the code comes from commits of 100 lines or less. Look only at the last commit that modified a line. Suppose that every line of code in Linux was tagged with the number of lines in the last commit that touched it. What fraction of the lines would have a label less than 10? Less than 100?”
Short answer, well more of a question: You are asking not about the percentage of activity that small commits represent but the total amount of work? I’m pretty sure we can calculate that but wouldn’t it just be the integral of the commit size distribution? (And then take the 1-100 SLoC range percentage of the total distribution?)