Spreadsheets at ICSE. Woo!
If you have used spreadsheets, you know that it is a common scenario to have multiple versions of the same file, for different months, projects or clients. But, spreadsheets aren’t commonly under version control, while spreadsheets stay alive for 5 years and are used then by 13 users. We all know that from my ICSE ’11 paper 🙂
If we have versions of spreadsheets, we could do cool things, like finding refactoring opportunities, like this:
Also, we can find inconsistencies between the versions:
Wensheng Dou made a new corpus, to help other researchers to do spreadsheet evolution research. Really very cool.
He clustered files based on filename similarity, and then checked if the spreadsheets had a similar enough worksheet and table structure.
That results in nice clusters, but which one case first? For that he did a rather smart thing, he used the emails in the Enron set to create a history! He looked at the date, but also the content of the emails, for example when someone writes “here is an updated version of the spreadsheet”, you know which one come first.
A few very interesting statistics:
- 78% of spreadsheet groups are maintained by more than 1 user
- Looking at the number of Excel errors (#DIV/0 etc.) we find that in 16.9% of groups errors are introduced
From the paper, some really cool stats on changes:
Preprint is here: http://www.tcse.cn/~wsdou/papers/2016-icse-venron.pdf
The data is here: http://sccpu2.cse.ust.hk/venron/