Data Clone Detection and Visualization in Spreadsheets

We just got the news this paper is accepted at ICSE 2013 in San Francisco.

The idea

We know from our experience with spreadsheet users, that copy-pasting is a common way to transfer data within or between spreadsheets.However, this can be error-prone when the original data may changes and the copies get outdated. In software engineering, it is well-known that clones in source code can be harmful and therefore several algorithms have been created to detect them. Our idea is to adapt clone detection algorithms in such a way that they can detect copy-pasting within spreadsheets.

The algorithm

Based on existing text-based clone detection algorithms, we have developed an algorithm to detect data clones in spreadsheets: formulas whose values are copied as plain text in a different location. We do so by creating a lookup table of cells with the same value and subsequently clustering groups of cells with matching values.

Visualizing clones

We have integrated the clone detection into our existing data flow diagram visualization, such that relationships between worksheets that are caused by copy-pasting  are shown as dotted arrows.

In the following diagram, this is illustrated. You can see that data from CustomerData has been copied to the ProductTable. This is useful to know if you are updating the CustomerData, since you will have to make a new copy to ProductTable to prevent the two worksheets from getting out of sync.

Evaluation

To evaluate the usefulness of the proposed approach, we conducted two evaluations.

First, we analyzed a subset of the EUSES corpus, a well-known set of spreadsheets used by many researchers to test their algorithms. We found that, with the right parameters, our algorithm can achieve a precision of 82%.

Furthermore, we studied two different real-life cases. The first case study was conducted at the South-Dutch foodbank, where employees keep track of all logistics using spreadsheets. In 2011 the distrubution centre processed an average of 130.000 kilograms of food per month with those spreadsheets. The second case study we evaluated a spreadsheet used by our university to calculate the budget for a large (25 million Euro) research proposal. With the case studies we aim to determine whether detected data clones actually pose a threat to spreadsheet quality.

The results of the evaluation clearly indicate that data clones are common, data clones pose threats similar to those code clones pose and our approach supports users in finding and resolving data clones.

Want to know more? Feel free to comment below or read the preprint on figshare.

BibTex 

@inproceedings{Hermans122,
   author = {Felienne Hermans and Ben Sedee and Martin Pinzger and Arie van Deursen},
   title = {Data Clone Detection and Visualization in Spreadsheets},
   booktitle = {Proceedings of ICSE '13},
   year = {2013},
   note = {to appear}
}