Rank Your Git Contributors Lines of Code (then be careful what you do with the data)

Lately at work there’s been a bit of an obsession with metrics – I’m not sure it’s healthy. Code metrics can be interesting, but in the wrong hands or used the wrong way, they are misleading and harmful. It’s still to be seen what the outcome will be in this case. I’m more than little worried.

That said, even I like to get some numbers on code, especially unfamiliar code, to help me figure out a few things. For example, I’ve been using CLOC (Count Lines of Code) for years to find out quickly how big a project is and what languages are used in what proportion.

The first metric that’s started to be tracked at work is lines of code added and deleted per release, per author. I’ve occasionally glanced at these kinds of numbers in the past too, and even wrote a little script (git-rank) to get the same kind of info when I was on a code deletion spree. I’ve found these numbers to be useful for getting a general idea of who is most knowledgeable about a project or some files within the project.

However, the numbers can easily be misleading. For example, if someone is committing a lot of style reformats, vendoring code from other projects, or autogenerating documentation, their numbers will be very high but they may know very little about the codebase. For example, for a release to a Rails project I’ve worked on the metrics gathered showed that I had added 248,475 lines and deleted 458,020 lines. That’s obviously suspicious, although it does make me look very active. These are the kinds of numbers that you should dig a little to figure out what’s happening, which is why I wrote my little git-rank script makes it easy to break those line totals down by file, and then exclude files for the next count.

$ git rank v1.1.1..v1.1.9 --all-authors-breakdown --author "Matt Robinson"
Matt Robinson         603061
                      1 vendor/gems/json_pure-1.5.1/tests/fixtures/fail14.json
                      1 vendor/gems/json_pure-1.5.1/tests/fixtures/fail12.json
                      1 vendor/rails/railties/test/vendor/gems/dummy-gem-a-0.4.0/lib/dummy-gem-a.rb
                      ........
                      21418 vendor/gems/haml-3.0.13/lib/haml/precompiler.rbc
                      24953 vendor/gems/haml-3.0.13/test/sass/engine_test.rbc
                      30038 vendor/gems/haml-3.0.13/test/haml/engine_test.rbc
Matt Robinson         603061`

It’s pretty obvious that most of what I did was update vendored gems. If you’re trying to figure out who did the most work between the releases I specified, you probably want to ignore the vendor directory when count lines of code.

$ git rank v1.1.1..v1.1.9 --exclude-file vendor
Josh Cooper           4
Nigel Kersten         8
Andreas Zuber         11
Michael Stahnke       11
Jacob Helwig          15
nfagerlund            37
Daniel Pittman        283
Max Martin            418
Nick Lewis            543
Randall Hansen        901
Pieter van de Bruggen 1018
Matt Robinson         1661

Now we’re starting to get a more realistic picture. From here we could dig deeper by listing all the files again for all the authors, and then maybe start excluding lines based on a regex if there were automated changes, and then maybe look at additions vs deletions in the count (I’m summing them together for these numbers).

However, I hope people never think they can get the whole picture, or even most of the picture, of a code base and who is contributing the most from something as arbitrary as lines of code counts. I feel like that’s almost too obvious to say, but I’ve heard horror stories of managers who tie reviews to metrics like this. I hope I never personally experience such a thing.

P.S. If anyone tries using the ‘git-rank’ project I mentioned above, please keep in mind it’s a hacky little sick project that is probably got some bugs and you should take the numbers is spits out with a grain of salt. What a weird phrase that is.

Rank Your Git Contributors Lines of Code (Then Be Careful What You Do With the Data)