Folding Repeated Instructions for Improving Token-Based Code Clone Detection

More SCAM talks! This session block (before the secret social event) is about cloning.

Hiroaki Murakami starts with explaining the problems with clone detection. One of them is overlapped code clones, as shown in this example.

This leads to false positive, since in this case 5 clone pairs are found, instead of 1. Hiroaki’s aim is to detect only one pair in these situations. He does so by analyzing the structure of the code fragment. Repeated lines are compressed into one (in the example, the appending of “A” and “B” is compressed and also the appending of “C”, “D” and “E”)

The authors implemented the approach into a tool called FRISC and evaluated it on 8 open source systems. The results showed that precision went up 70%, however, recall went down with 4%.  So the folding comes at a price, although quite small.

The idea is interesting, also for my won recent work on clone detection in spreadsheets. We could apply similar methods and fold corresponding values. Need to give that some more thought!

Unfortunately, the paper is not available online yet.