Mining Idioms from Source Code – Miltiadis Allamanis

Another paper by Miltiadis Allamanis! Amazing. Let’s see if it is as cool as this morning’s talk that I unfortunately did not live blog, but did summarize in one tweet:

So, what is a code idiom? An example is:

while (($(String) = $(BufferedReader).readLine()) != null) {
@BODY
}

(updated after comment below)

In order words, it is a high-level description on how to write code. Programmers like this! There are 49,000 hits on StackOverflow that refer to idiomatic code. Isn’t this just a clone or a pattern? No, idioms are syntactic constructions.

The mining problem now, is as follows: we have a corpus of ASTs and what to extract idioms from that. For this, Miltos is going to use a probabilistic tree substitution grammars. It would be nice if we could construct them (as is done in natural language) but, we will have to infer them from the data. Once we have them, they can be filtered and then we will have the idioms.

Now, we still have to test them. Miltos’s approach is as follows:

20141119_160708

A few of the mined idioms are things like: define a string constant, create a logger, loop through lines of a buffer (as described above) There were also library specific idioms, like create a db transaction for node4j, or get a HTML document on jsoup, or showing a popup in Android.

To evaluate the quality of the idioms, the authors analyzed the following open source projects:

Capture

On these, precision and recall was calculated as follows: “We define idiom coverage as the percent of source code AST nodes that can be matched to the mined idioms. Coverage is thus a number between 0 and 1 indicating the extent to which the mined idioms exist in a piece of code. We define idiom set precision as the percentage of the mined idioms found in the test corpus.” (from the paper), with the following result.

Capture

Miltos compared the extracted idioms with code examples on StackOverflow and they were quite similar, mined idioms occurred in 31% of the SO posts 31% of mined idioms occurred in SO examples (updated after comment below)

As another form of evaluation, Miltos compared his extracted idioms to the SnipMatch project. It turns out that 19 were already in it, so Miltos submitted the others: 5 were accepted, 4 were unsupported by the tool, 1 was rejected as a bad practice, and 15 are still waiting.

I really liked the idea, and especially the potential it offers to give feedback to developers while coding, like issue a warning when you are not conforming to an idiom commonly used in your project or language.

Preprint is available on Arxiv.org.

2 Comments

  1. MIltos Allamanis

    Hi,

    Just to note the idiom on the top is:

    while (($(String) = $(BufferedReader).readLine()) != null) {
    @BODY
    }

    Apart from the projects that we used to mine idioms, we also used a set of files that use (import) some libraries (Fig 6 in the paper). This lets us find idioms both inside projects and across the usages of a library. These are the results that the top of Fig 9 refers to!

    Finally, for the evaluation using StackOverflow, coverage of 31% does not mean that we caputre 31% of the posts, but 31% of the AST nodes of the StackOverflow snippets (I don’t know how this maps to StackOverflow posts). However, it is important to note that precision is more important than coverage, since it shows that we have a good set of snippets that commonly reccur in code.

  2. Pingback: Mining Idioms from Source Code « Another Word For It

Comments are closed.