The sweet spot of code commenting in open source

In a large-scale study of active working open source projects we have found an average comment density of about 20% (= one comment line in five code lines). Given that much of open source remains volunteer work, we believe that a comment density of 20% represents the sweet spot of code commenting in open source projects: Neither are you over-documenting your code and hence wasting resources, nor are you under-documenting and thereby endangering your project.

This statement is based on the argument that programmers only document as much as necessary, but not more. Since open source remains merit-driven, usually no single person can order you to document more (or less), and a comment density of 20% on average seems to be were most projects end up naturally. This argument is further supported by the analysis provided below. This analysis shows how the comment density is largely independent of other variables like project or team size and is only slowly decreasing as a project matures.

Comment Density

Figure 1 gives the overall picture. Each data point (dot) represents one project and its comment density. The comment density across all 5000 projects averages to about 18.7%.

Figure 1

Figure 2 reviews comment density by size of commit (code contribution). As can be seen, smaller commits have a higher than average comment density. In particular, a commit of one source code line comes with two comment lines, and a commit of two source code lines also comes with two comment lines.

Figure 2

Figure 3 shows how comment density varies by team size (number of committers). As can be seen, it does not vary (fluctuation only gets higher with larger team sizes, of which there are less and less). Comment density and team size are not correlated at all! The same applies to project size (not shown).

Figure 3

Figure 4 finally shows how comment density varies by project age. There is a correlation, and comment density goes down as a project gets older. Basically, it appears that with a maturing project, developers document less. Please note, however, that the actual decline in comment density, while statistically significant, is quite small and may not matter in the grand scheme of things.

Figure 4

Please refer to the accompanying four-page research paper for more details. I understand that much more needs to be said. You may also like the prior post on how open source comments, by programming language.

The data used in this analysis was provided by Ohloh.net.

Science Soapbox

We had originally submitted a much more detailed analysis to ICSE 2009. Unfortunately, the submission got rejected, and even the referenced four-pager almost didn’t make it. This is because, as one reviewer put it, “this work has no merit, because it only measures [open source projects].” This comment still annoys me, because of its warped understanding of science. What are we supposed to do, if not measure the real world? Freely phantasize about how software development works? Next time you hear some wild claims about how software development supposedly works, please ask the presenter what data or studies they base their work on.

Posted on

2009-02-04

1. Software Industry, 1.2 Open Source (Industry)

Tagged as (if any)

Subscribe!

Comments

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Dirk Riehle

2009-04-24

Hi Mike, glad you liked the work, much more to come!
I think simple open-source tools for this are coming, but they are a far cry from what you would want professionally.
May I ask: Are you worried about mixing open with closed source for reasons of protecting your intellectual property?
A commercial solution to that problem is being offered by Black Duck Software but of course you have to pay. Their tools can recognize open source code in your closed projects. Is that your need? (I’m honestly curious about that market.) Thanks!

Reply
Mike

2009-04-24

I was looking for tools to help non-engineers or code project managers spot open source code usage in proprietary code projects. This page has really been helpful for ideas (analyzing comment grammer and comment density, or comment locations might be one high-level triage method). If anyone has any other ideas for simple parsing tools that might be able to identify proprietary code versus open source source code within a given code project I’d be very grateful.

Reply
Dirk Riehle

2009-04-19

Hey Philippe, thanks for commenting!
Yep, no value judgement or even attempted explanation of different comment densities. Without further work, this would only get us into hot water…
Naturally, we all want quality comments, not boilerplate aut0-generated ones. At present, “our” parser (which is ohcount, to be precise), does not do a semantic analysis as to the content of comments. It would be another good step to add that information.
Personally, I don’t think that after more filtering we’ll only end up with junk. Why? We have some indications, for example, commits of SLoC 1 typically come with 2 comment lines. These should be comment lines with real contents. And from other work we know that more than 10% of all commits are 1 SLoC commits… so there is good commenting going on in open source.
What I’m really interested in, to be determined in future work, is of course how much commenting is needed to keep a project thriving, for example, how much does it help to get new people on board.

Reply
Philippe Ombredanne

2009-04-19

Dirk:
Very nice!
I am glad you make no claim about the value of commenting, and getting some indication of it could be an interesting future contribution.
I am always appalled when I read code — and I read a lot of it — about the amount of useless, crappy and often auto-generated comments that sneak in the code. IDEs are often to blame for that, together with poor developer education on the value of comments.
I much prefer well written and readable code with no comments, than a page a comment-junk laced code.
I wonder about how much of these comments really mean something, ie add something of value to the code comprehension.
One thing to consider in future studies could be trying to discern between potentially valuable and useless comments, such as copyrights/licenses notices, changelogs and comment-junk. For the junk part, the length of structured comments (ie javadoc, perdoc, pydoc, phpdoc, etc) may be useful: it is often very short and prefixed by doc tags .
My hunch: once you have discounted the boilerplate and junk, there is not much left.
Is that a problem? Most likely not.

Reply
The Shape of Code » Using third party measurement data

2009-02-17

[…] about comment usage in a large number of projects (around 10,000). The authors complained in their blog about some of the referees comments and having to submit a shorter paper. I can see where the […]

Reply
Schwinl! » Blog Archive » Dix-neuf pourcents de commentaires…

2009-02-07

[…] open-source, on trouve en moyenne près d’une ligne sur cinq de commentaires, selon un papier de Dirk Riehle (voir également sur son […]

Reply
19 pour cent de commentaires | Sur la route d'Oxiane

2009-02-06

[…] on trouve en moyenne près d’une ligne sur cinq de commentaires, selon un papier de Dirk Riehle (voir également sur son […]

Reply
Dirk Riehle

2009-02-06

@AA Thanks for reminding us; yes, we might do that.

Reply
AA

2009-02-06

Did you consider sending your work to MSR? I guess they’d value your paper more. Deadline has been extended to begin of March.

Reply
Labnotes » Rounded Corners 224 – Broken gets fixed, shoddy lasts forever

2009-02-06

[…] On the other hand, open source is a corpus of work you can easily analyze, and it turns out that 20% is the sweet spot of commenting in open source projects. So there, another good rule of […]

Reply
Dirk Riehle

2009-02-05

@Jack Repenning: We are doing exactly that, preparing a large-scale comparison of open source with closed source. But of course closed source remains hard to get by 🙂
What’s the dominant programming language in your code? Our analysis does not include empty lines, only comment lines and source code lines.
As to reasons of why something might differ—that would be future research! The most common hypothesis is probably be that public scrutiny is the most important driver of hacker behavior.

Reply
Dirk Riehle

2009-02-05

@Gabriel Burt: Thanks for pointing this out. I agree that projects have or should have documentation outside the actual code. Here, we really just look at the code documentation. (That outside documentation has the nasty tendency to get out-of-date, which is why most people I know approach it with some mistrust.)
There are a couple of aspects that we could focus on to make the numbers more precise. I think programming language is the most important one. Java, for example, has all these auto-generated comment stubs which blow up numbers and are out of line with other languages.
In general, even if a specific number has some error margin, you can still get a good indicator of where your project might want to be if you only use the same measure.

Reply
Jack Repenning

2009-02-05

I take your point, in the Soapbox section, that analysis solely of FOSS code is still analysis, and has merit. Still, it would definitely be interesting to have similar analysis for commercially developed code–and of course, the comparison would instantly spawn questions about the differences.
For example, I just did a quick scan of some of our code, and found comments and blank lines close to 50%. That’s a big spread, compared to your numbers, and all the more surprising because my company has very deep FOSS roots, and most of our developers and culture are very FOSS-like.

Reply
Gabriel Burt

2009-02-05

Some projects document APIs inline, while others do not – they use external tools and reserve inline documentation for implementation info. This is probably a distinction that should be made in your analysis.

Reply