The Dominance of Small Code Contributions

What is the most com­mon size of code con­tri­bu­tions to open source? Maybe 30 lines of source code? 200 lines? Or just one line? What’s your guess?

In a recent paper on the com­mit size dis­tri­b­u­tion in open source we show that the most com­mon size of code con­tri­bu­tions is one line of source code. Of all com­mits in our more than 8 mil­lion strong sam­ple, one-line source code com­mits rep­re­sent more than 12%, two-line com­mits rep­re­sent 9%, and three-line com­mits rep­re­sent 5.5% of all com­mits. The fol­low­ing fig­ure shows this data.

In gen­eral, small com­mits dom­i­nate open source. The fol­low­ing fig­ure shows com­mits of sizes 1–100 source code lines in our sam­ple pop­u­la­tion. The 1–100 source code line com­mits make up more than 83% of all com­mits. As one can see, it is an almost strictly falling curve. In fact, the paper shows that this curve can be closely mod­eled by a power law. But for that you have to dig into the paper itself.

What are your thoughts? Are you sur­prised or is it obvi­ous to? What the­o­ries are on your mind? Maybe we have the data to val­i­date or inval­i­date your hypothe­ses.

  1. Morten Blaabjerg

    Your study nicely con­firms every­thing Clay Shirky says about power law dis­tri­b­u­tions in “Here Comes Every­body”. The results are not sur­pris­ing — they con­firm every­thing known from other such open social sys­tems, such as those Shirky gives a look in his book — f. ex. photo dis­tri­b­u­tions on Flickr tagged “mer­maid parade” or “Iraq”.

    An impor­tant point about power law dis­tri­b­u­tions (as Shirky also notes in the book) is that the one line code change no. 517 may also be the one which closes a vital secu­rity hole. So “closed” sys­tems, such as com­pa­nies which can only hire so many peo­ple, may lose out from a lot of vital input and value, even if it seems at a glance only 20% of con­trib­u­tors are doing the major parts of the work being done.

  2. Dirk Riehle Post author

    That’s an impor­tant point—speaks to enabling the wis­dom of the crowds—i.e. you want diver­sity in your inno­va­tion and input.

  4. Dirk Riehle Post author

    Ralph John­son com­mented on Plaxo Pulse but I pre­fer to dis­cuss it here 🙂

    From Ralph: “An inter­est­ing ques­tion is what per­cent of the code comes from com­mits of 100 lines or less. Look only at the last com­mit that mod­i­fied a line. Sup­pose that every line of code in Linux was tagged with the num­ber of lines in the last com­mit that touched it. What frac­tion of the lines would have a label less than 10? Less than 100?”

    Short answer, well more of a ques­tion: You are ask­ing not about the per­cent­age of activ­ity that small com­mits rep­re­sent but the total amount of work? I’m pretty sure we can cal­cu­late that but wouldn’t it just be the inte­gral of the com­mit size dis­tri­b­u­tion? (And then take the 1–100 SLoC range per­cent­age of the total dis­tri­b­u­tion?)

