How open source comments (by programming language)

We recently looked at the commenting practice of active working open source projects. It is quite impressive: The average comment density of open source is around 19%. (Comment density is the percentage of text that are comments, or, more formally: comment density = comment lines / (comment lines + source code lines); for example, two lines of text, one a comment line and one a source code line, have a 50% comment density.) A 19% comment density is much more documentation than most people thought!

However, such a rough number needs discussion. Here, we look at the comment density on a programming language basis. As it turns out, the comment density of active open source projects varies by programming language. Not surprisingly, Java is leading the bunch.

Below, for six major programming languages, you can see the average comment density as well as its distribution (cast as a histogram).

Figure 1: Comment density distribution of Java

Average: 26% – Stddev: 11% – Population Size: 1085 projects

Figure 2: Comment density distribution of php

Average: 22% – Stddev: 12% – Population Size: 559 projects

Figure 3: Comment density distribution of C and C++

Average: 18% – Stddev: 8% – Population Size: 1621 projects

Figure 4: Comment density distribution of Javascript

Average: 16% – Stddev: 9% – Population Size: 276 projects

Figure 5: Comment density distribution of Python

Average: 11% – Stddev: 8% – Population Size: 534 projects

Figure 6: Comment density distribution of Perl

Average: 10% – Stddev: 7% – Population Size: 273 projects

Table 1: Summary of comment densities of popular programming languages

#	Language	Average [%]	Stddev [%]	Pop Size [projs]
1.	Java	26%	11%	1085
2.	php	22%	12%	559
3.	C/C++	18%	8%	1621
4.	Javascript	16%	9%	276
5.	Python	11%	8%	534
6.	Perl	10%	7%	273

What does this data mean? Here are a couple of hypotheses, and I very much would like to hear your opinion on what you think is going on.

Java is leading the bunch, but that’s only because of all the auto-generated comments in IDEs like the JDT (Eclipse);
Java and C/C++ (and all C-style languages) need more comments because it is harder to express the programming intent;
Dynamic languages like Python, php, or Perl are more expressive and therefore need less comments than static programming languages;
php is different from the rest of the dynamic programming languages (why?) and you can’t group them anyway.

These are serious hypotheses, even if they may be hard to validate. How would you go about validating them? What other thoughts or explanations for the data do you have? What do you think we should look at?

Thanks once more for your thoughts!

Data was generously provided by Ohloh.net.

Posted on

2008-11-10

1. Software Industry, 1.2 Open Source (Industry)

Tagged as (if any)

Subscribe!

Comments

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

firman

2009-10-12

a good explain, but you have to look another program like pascal, small talk etc.
thanx…..

Reply
Dirk Riehle

2008-11-18

@Ted Young: Thanks! Yes, I agree that we can make the data sharper for each language. Just have to enhance the differ/parser 🙂

Reply
Ted Young

2008-11-17

In terms of Java, it would be interesting to run the comments through a filter that would do the following:
* Remove empty comments (as Martin suggests above).
* Remove “boilerplate” comments: copyright notices, etc., that are the same across all of the files.
* Remove class and method comments that appear to be auto-generated, i.e., those that have only the parameter names without any additional explanation.
* Remove (but also count!) the “TODO” and “HACK” and “FIXME” types of comments. It’d be very interesting to see how many of those there are.
I’ve seen lots of Java projects (closed- and open-source) that appear, at first, to have lots of comments, but then reading them I find a lot of boilerplate and auto-generated comments.

Reply
Dirk Riehle

2008-11-12

@Guido: Thanks! This sounds well worth exploring (and also quite hard to do). We would have to classify all languages by “how many gotchas they have” as well as how broadly they have been adopted and then correlate that with comment density?
What would you expect to see? Few gotchas, top notch programmers -> low comment density? Lots of gotchas, broad adoption -> high comment density? The underlying argument would be that code that is hard to write for a programmer will lead to more comments?

Reply
Dirk Riehle

2008-11-12

@Martin: Good questions! To some of them we actually have data. We actually expected comment density to go down with project size, however, it seems completely unrelated. More precisely, we looked at the correlation between (aggregate) comment density and project size as well as team size—no correlation whatsoever. I take this to mean that open source projects are pretty good at maintaining a commenting discipline.
Our differ/parser is ohcount and while it recognizes different languages and can handle multi-line comments, it doesn’t filter out empty comment lines. This is a good idea we should implement!
I think you are correct in that we shouldn’t try to group programming languages that shouldn’t be grouped. So may be dynamic languages aren’t really a group. (I think that C-style languages still probably are.)

Reply
Guido van Rossum

2008-11-11

Could it have to do with a combination of how many “gotchas” there are in a language vs. the competence level of the typical developer in that language? C/C++ have a high level of gotchas but as a result only the best programmers write in these languages. Java has fewer gotchas but also a much larger pool of not especially stellar programmers. PHP has many gotchas and is for the masses. JavaScript is a bit of an outlier; this may be skewed because most JS projects are small. Python is a fairly elite language (there was a review a few years ago indicating Python programmers earn more on average). Perl is a bit waning (see the TIOBE index) and is probably now mostly used by long-time users who are *very* competent in it (or believe they are).
This is of course a *very* course characterization, and not a bit subjective, but it’s the best I can do.

Reply
Martin

2008-11-11

Hi Dirk, some of my thoughts.
Java, C++ (, Python) are often used to implement larger programs (code size, functionalities) that thus require a higher quality by means of more elaborate documentation. I guess that the comment density correlates to the project size.
Did you remove empty comment lines that constantly occur in Java comments?
Perhaps Perl was mainly employed to solve smaller tasks (in your test data). Concerning expressiveness of code: I frequently experienced that Perl scripts of experienced developers are hardly readable, contain a descriptive comment in the header of each file only, and that there is a common bad habit not to comment anything in the code.
Finally two cents to PHP and JavaScript. To my mind one should set both languages apart from the remaining ones as they are used for different purposes?! So I would not expect them to be comparable with the other languages.
Best!

Reply