Look at this method:
This is the problem that this paper addresses. The goal is to have a tool that can determine what percentage of the code is written in a certain (natural) language.
This problem is quite hard to solve, as comments and identifiers are short, majority of comments <100 chars and identifiers <30. They also contain abbreviations and often mix languages, like getKlant.
The basic approach works by using several algorithms and have them ‘vote’ on the language:
Using two different approaches (one for comments and one for identifiers) very high precision and recall (I missed the slide but above 90%) were achieved.
So why would we care?
With this tool, we can now figure out whether this is a problem that actually occurs! (Nice way to use your research to proof its usefulness)
Timo studied whether there are a lot of projects that are multi-language, and also studied the difference between industry and open source code.
Open source systems largely seem to be in English or unknown languages:
Comparing this to industry systems, the difference is really large:
Considerably more non-English identifiers and comments are found in industry projects.
Is this good or bad? Of course, recruiting developers in a mixed natural language might be a challenge, but the identifiers might be closer to the domain. This is a open question.
A closing remark from Timo in general about researching code bases is: take industry systems into account is needed, as they might be very different!
The paper is available as a preprint.