Finding Errors in the Enron Spreadsheet Corpus — Thomas Schmitz

Yeah, more people working on my things! After the VEnron paper at this year’s ICSE, there is now someone that dived into real errors in the dataset. We all know the problems, spreadsheets are error-prone (at least as error prone as other forms of software) but how to test errors. The state of the art is to have either fake faults or fake spreadsheets. By that Thomas means real-world spreadsheets and inject faults into them, for example Erwig’w work on mutation operators, or to create artificial spreadsheets that have faults, like spreadsheets created by students. These approaches are nice, but we would love to have real spreadsheets with real errors. This is where the Enron corpus comes into play.

Take this as an example: how do we know the formula is wrong? Thomas figured out it is really wrong, because he found a newer version and an email explaining the mistake.

The approach is as follows. First Thomas locates candidate spreadsheets, which are different versions of the same spreadsheet. Then, the system visualizes the changes, to help researchers to compare them. The goal is not to find errors, but to help researchers find them, so they can be used in a benchmark.

When an error is alluded to in the emails, one can inspect the changes with the email and figure out what was wrong:
2016-09-06 15.22.09

Every change to related spreadsheets could be a fix, but also a change to a business rule. Therefore Thomas’s system make the changes easy, showing them in the form of: in cell I34 the formula SUM(I9:I33) was changed to SUM(I8:I33). We do not know if a row was added, or a fault was fixed, but in this case a fault was fixed. To narrow down the set of candidates, smaller sets of changes are preferred, because big changes are more likely to be a complete overhaul than an error fix.

The dataset of the errors is available and just to safeguard against Thomas’s graduation, they are on my site too: enron-errors. You can also download the tool yourself and go hunt for errors in spreadsheets in your company 🙂

This post was visited 190 times.