VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis — Wensheng Dou

Spreadsheets at ICSE. Woo!

If you have used spreadsheets, you know that it is a common scenario to have multiple versions of the same file, for different months, projects or clients. But, spreadsheets aren’t commonly under version control, while spreadsheets stay alive for 5 years and are used then by 13 users. We all know that from my ICSE ’11 paper 🙂

If we have versions of spreadsheets, we could do cool things, like finding refactoring opportunities, like this:

2016-05-19 15.07.53

Also, we can find inconsistencies between the versions:

2016-05-19 15.09.11

Wensheng Dou made a new corpus, to help other researchers to do spreadsheet evolution research. Really very cool.

He clustered files based on filename similarity, and then checked if the spreadsheets had a similar enough worksheet and table structure.

That results in nice clusters, but which one case first? For that he did a rather smart thing, he used the emails in the Enron set to create a history! He looked at the date, but also the content of the emails, for example when someone writes “here is an updated version of the spreadsheet”, you know which one come first.

A few very interesting statistics:

  • 78% of spreadsheet groups are maintained by more than 1 user
  • Looking at the number of Excel errors (#DIV/0 etc.) we find that in 16.9% of groups errors are introduced

From the paper, some really cool stats on changes:


Preprint is here:
The data is here: