We recently looked at the commenting practice of active working open source projects. It is quite impressive: The average comment density of open source is around 19%. (Comment density is the percentage of text that are comments, or, more formally: comment density = comment lines / (comment lines + source code lines); for example, two lines of text, one a comment line and one a source code line, have a 50% comment density.) A 19% comment density is much more documentation than most people thought!
However, such a rough number needs discussion. Here, we look at the comment density on a programming language basis. As it turns out, the comment density of active open source projects varies by programming language. Not surprisingly, Java is leading the bunch.
Below, for six major programming languages, you can see the average comment density as well as its distribution (cast as a histogram).
Figure 1: Comment density distribution of Java
Average: 26% – Stddev: 11% – Population Size: 1085 projects
Figure 2: Comment density distribution of php
Average: 22% – Stddev: 12% – Population Size: 559 projects
Figure 3: Comment density distribution of C and C++
Average: 18% – Stddev: 8% – Population Size: 1621 projects
Average: 16% – Stddev: 9% – Population Size: 276 projects
Figure 5: Comment density distribution of Python
Average: 11% – Stddev: 8% – Population Size: 534 projects
Figure 6: Comment density distribution of Perl
Average: 10% – Stddev: 7% – Population Size: 273 projects
Table 1: Summary of comment densities of popular programming languages
|#||Language||Average [%]||Stddev [%]||Pop Size [projs]|
What does this data mean? Here are a couple of hypotheses, and I very much would like to hear your opinion on what you think is going on.
- Java is leading the bunch, but that’s only because of all the auto-generated comments in IDEs like the JDT (Eclipse);
- Java and C/C++ (and all C-style languages) need more comments because it is harder to express the programming intent;
- Dynamic languages like Python, php, or Perl are more expressive and therefore need less comments than static programming languages;
- php is different from the rest of the dynamic programming languages (why?) and you can’t group them anyway.
These are serious hypotheses, even if they may be hard to validate. How would you go about validating them? What other thoughts or explanations for the data do you have? What do you think we should look at?
Thanks once more for your thoughts!
Data was generously provided by Ohloh.net.
7 Replies to “How Open Source Comments (by Programming Language)”
Hi Dirk, some of my thoughts.
Java, C++ (, Python) are often used to implement larger programs (code size, functionalities) that thus require a higher quality by means of more elaborate documentation. I guess that the comment density correlates to the project size.
Did you remove empty comment lines that constantly occur in Java comments?
Perhaps Perl was mainly employed to solve smaller tasks (in your test data). Concerning expressiveness of code: I frequently experienced that Perl scripts of experienced developers are hardly readable, contain a descriptive comment in the header of each file only, and that there is a common bad habit not to comment anything in the code.
This is of course a *very* course characterization, and not a bit subjective, but it’s the best I can do.
@Martin: Good questions! To some of them we actually have data. We actually expected comment density to go down with project size, however, it seems completely unrelated. More precisely, we looked at the correlation between (aggregate) comment density and project size as well as team size—no correlation whatsoever. I take this to mean that open source projects are pretty good at maintaining a commenting discipline.
Our differ/parser is ohcount and while it recognizes different languages and can handle multi-line comments, it doesn’t filter out empty comment lines. This is a good idea we should implement!
I think you are correct in that we shouldn’t try to group programming languages that shouldn’t be grouped. So may be dynamic languages aren’t really a group. (I think that C-style languages still probably are.)
@Guido: Thanks! This sounds well worth exploring (and also quite hard to do). We would have to classify all languages by “how many gotchas they have” as well as how broadly they have been adopted and then correlate that with comment density?
What would you expect to see? Few gotchas, top notch programmers -> low comment density? Lots of gotchas, broad adoption -> high comment density? The underlying argument would be that code that is hard to write for a programmer will lead to more comments?
In terms of Java, it would be interesting to run the comments through a filter that would do the following:
* Remove empty comments (as Martin suggests above).
* Remove “boilerplate” comments: copyright notices, etc., that are the same across all of the files.
* Remove class and method comments that appear to be auto-generated, i.e., those that have only the parameter names without any additional explanation.
* Remove (but also count!) the “TODO” and “HACK” and “FIXME” types of comments. It’d be very interesting to see how many of those there are.
I’ve seen lots of Java projects (closed- and open-source) that appear, at first, to have lots of comments, but then reading them I find a lot of boilerplate and auto-generated comments.
@Ted Young: Thanks! Yes, I agree that we can make the data sharper for each language. Just have to enhance the differ/parser 🙂
a good explain, but you have to look another program like pascal, small talk etc.