We recently looked at the commenting practice of active working open source projects. It is quite impressive: The average comment density of open source is around 19%. (Comment density is the percentage of text that are comments, or, more formally: comment density = comment lines / (comment lines + source code lines); for example, two lines of text, one a comment line and one a source code line, have a 50% comment density.) A 19% comment density is much more documentation than most people thought!
However, such a rough number needs discussion. Here, we look at the comment density on a programming language basis. As it turns out, the comment density of active open source projects varies by programming language. Not surprisingly, Java is leading the bunch.
Below, for six major programming languages, you can see the average comment density as well as its distribution (cast as a histogram).
Figure 1: Comment density distribution of Java
Average: 26% – Stddev: 11% – Population Size: 1085 projects
Figure 2: Comment density distribution of php
Average: 22% – Stddev: 12% – Population Size: 559 projects
Figure 3: Comment density distribution of C and C++
Average: 18% – Stddev: 8% – Population Size: 1621 projects
Average: 16% – Stddev: 9% – Population Size: 276 projects
Figure 5: Comment density distribution of Python
Average: 11% – Stddev: 8% – Population Size: 534 projects
Figure 6: Comment density distribution of Perl
Average: 10% – Stddev: 7% – Population Size: 273 projects
Table 1: Summary of comment densities of popular programming languages
|#||Language||Average [%]||Stddev [%]||Pop Size [projs]|
What does this data mean? Here are a couple of hypotheses, and I very much would like to hear your opinion on what you think is going on.
- Java is leading the bunch, but that’s only because of all the auto-generated comments in IDEs like the JDT (Eclipse);
- Java and C/C++ (and all C-style languages) need more comments because it is harder to express the programming intent;
- Dynamic languages like Python, php, or Perl are more expressive and therefore need less comments than static programming languages;
- php is different from the rest of the dynamic programming languages (why?) and you can’t group them anyway.
These are serious hypotheses, even if they may be hard to validate. How would you go about validating them? What other thoughts or explanations for the data do you have? What do you think we should look at?
Thanks once more for your thoughts!
Data was generously provided by Ohloh.net.