This paper concerns the analysis of data in spreadsheets, focusing on duplication of data.
So, how do I do this?
I visit DBLP (contains 2014228 papers) and enter spreadsheet data, this results in 44 results. I notice two very interesting paper that do not have to do with the related work section I am writing, but I want to read some other time, Unobtrusive data acquisition for spreadsheet research and Automated testing of databases and spreadsheets – the long and the short of it this one <meta> This how do stuff is great. I am promising you guys that I will read and blog about those papers, so please remind me of that in a week or two </meta>
This one does look related: An Investigation of the Incidence and Effect of Spreadsheet Errors Caused by the Hard Coding of Input Data Values into Formulas Our paper does not concern hard values in formulas, but hard values in the cells itself, but this could still be useful to understand the role of values in formula. Furthermore, fixed numbers is one of the checks we perform in the Infotron Analyzer, so knowing more about it could be useful.
And there is also this one: Implications of data quality for spreadsheet analysis does look related, I am going to read this one (blog about it in the next post) This leads me to the idea to search specifically from data quality in spreadsheets. Let’s do that now. Unfortunately, there is only one other hit, besides this paper, it is by Patrick O Beirne, chairman of Eusprig. Seems related and feels like the best starting point at this point.
But let’s also try Google Scholar, it sometimes has different hits than DBLP. Many papers on data quality, only one that concerns spreadsheets: Quality control in spreadsheets: A software engineering-based approach to spreadsheet development not directly related, but related to the things we are doing <meta> I think I have read this paper before and possibly even cited it, but I don’t know. Wasn’t blogging then 🙂 </meta> Maybe thay are also considering the data perspective.
Let’s try that duplication angle now. I try ‘data duplication detection’ on Google Scholar. I find Duplicate Record Detection: A Survey This paper concerns finding duplicate data in databases, the problem is the same, so there might be some useful methods in this one. A paper titled Data duplication: An imbalance problem? looked interesting, but the abstract indicated it was very detailed, and I am looking for more general papers now. Just a fun statistic, this query results in 338.000 hits on Google Scholar and in 3 on DBLP. Quite the difference. There is one angle I have not yet tried, duplication in code, this is usually referred to as clones.
‘Code clones’ results in 46 hits on DBLP. I have a look at the oldest, they might have the best description of the problems. It is this one: Removing Clones from the Code Furthermore this title looks interesting, since it is the only one on the list with quality in the title: Software Quality Analysis by Code Clones in Industrial Legacy Software.
Now it is probably time to go read some of those papers and check their references. I am starting with these:
Implications of data quality for spreadsheet analysis
Information and Data Quality in Spreadsheets
Software Quality Analysis by Code Clones in Industrial Legacy Software
Duplicate Record Detection: A Survey
Removing Clones from the Code
The whole search took me about three hours.