November 20, 2012April 7, 2017Blog by felienne

Data Clone Detection and Visualization in Spreadsheets

We just got the news this paper is accepted at ICSE 2013 in San Francisco.

The idea

We know from our experience with spreadsheet users, that copy-pasting is a common way to transfer data within or between spreadsheets.However, this can be error-prone when the original data may changes and the copies get outdated. In software engineering, it is well-known that clones in source code can be harmful and therefore several algorithms have been created to detect them. Our idea is to adapt clone detection algorithms in such a way that they can detect copy-pasting within spreadsheets.

The algorithm

Based on existing text-based clone detection algorithms, we have developed an algorithm to detect data clones in spreadsheets: formulas whose values are copied as plain text in a different location. We do so by creating a lookup table of cells with the same value and subsequently clustering groups of cells with matching values.

Visualizing clones

We have integrated the clone detection into our existing data flow diagram visualization, such that relationships between worksheets that are caused by copy-pasting are shown as dotted arrows.

In the following diagram, this is illustrated. You can see that data from CustomerData has been copied to the ProductTable. This is useful to know if you are updating the CustomerData, since you will have to make a new copy to ProductTable to prevent the two worksheets from getting out of sync.

Evaluation

To evaluate the usefulness of the proposed approach, we conducted two evaluations.

First, we analyzed a subset of the EUSES corpus, a well-known set of spreadsheets used by many researchers to test their algorithms. We found that, with the right parameters, our algorithm can achieve a precision of 82%.

Furthermore, we studied two different real-life cases. The first case study was conducted at the South-Dutch foodbank, where employees keep track of all logistics using spreadsheets. In 2011 the distrubution centre processed an average of 130.000 kilograms of food per month with those spreadsheets. The second case study we evaluated a spreadsheet used by our university to calculate the budget for a large (25 million Euro) research proposal. With the case studies we aim to determine whether detected data clones actually pose a threat to spreadsheet quality.

The results of the evaluation clearly indicate that data clones are common, data clones pose threats similar to those code clones pose and our approach supports users in finding and resolving data clones.

Want to know more? Feel free to comment below or read the preprint on figshare.

BibTex

@inproceedings{Hermans122,
   author = {Felienne Hermans and Ben Sedee and Martin Pinzger and Arie van Deursen},
   title = {Data Clone Detection and Visualization in Spreadsheets},
   booktitle = {Proceedings of ICSE '13},
   year = {2013},
   note = {to appear}
}

felienne

January 10, 2024January 30, 2024Blog ongevraagd-advies

The case of Claudine Gay was not about plagiarism

After an American media frenzy of a few weeks, the new president of Harvard has resigned after the shortest term ever, only 6 months. This saga began with a US congressional hearing in December at which the presidents of drie major universities (all three women) were asked to say something about the increasing anti-Semitism at […]

October 7, 2023October 16, 2023Blog BNR

Teaching is the best job in the world!

This week’s unsolicited advice is about teachers. You can find all columns via Spotify, this edition can be found here: Listen to this column on BNR (Dutch)Listen to this column on Spotify (Dutch) Teachers! Last weekend I read a great piece in the NRC, editor Patricia Veldhuis spent the past year as a special intern […]

September 20, 2023October 16, 2023Blog BNR

Unbelievable AI

Time for a new Unsolicited Advice column this week! You can follow all Unsolicited Advice via Spotify, you can find this edition here: This column on BNR (Dutch)This column on Spotify (Dutch) An “incredible” video this week on Twitter/X: Wow!! With the app HeyGen you can effortlessly speak multiple languages! Alexander Klopping says something in […]

Data Clone Detection and Visualization in Spreadsheets

Data Clone Detection and Visualization in Spreadsheets

felienne

Visiting Address

Postal Adress

Recent posts

Data Clone Detection and Visualization in Spreadsheets

Share this:

felienne

Related Posts

Share this:

Share this:

Share this: