Is This Code Written in English? A Study of the Natural Language of Comments and Identifiers in Practice – Timo Pawelka

Look at this method:


This is hard to read if you don’t know Japanese!


This is the problem that this paper addresses. The goal is to have a tool that can determine what percentage of the code is written in a certain (natural) language.

This problem is quite hard to solve, as comments and identifiers are short, majority of comments <100 chars and identifiers <30. They also contain abbreviations and often mix languages, like getKlant.

The basic approach works by using several algorithms and have them ‘vote’ on the language:


Using two different approaches (one for comments and one for identifiers) very high precision and recall (I missed the slide but above 90%) were achieved.

So why would we care?

With this tool, we can now figure out whether this is a problem that actually occurs! (Nice way to use your research to proof its usefulness)

Timo studied whether there are a lot of projects that are multi-language, and also studied the difference between industry and open source code.

Open source systems largely seem to be in English or unknown languages:


Comparing this to industry systems, the difference is really large:


Considerably more non-English identifiers and comments are found in industry projects.

Is this good or bad? Of course, recruiting developers in a mixed natural language might be a challenge, but the identifiers might be closer to the domain. This is a open question.


A closing remark from Timo in general about researching code bases is: take industry systems into account is needed, as they might be very different!

The paper is available as a preprint.