A modern day Pompeii: Spreadsheets at Enron

Petrified victim at Pompeii

Real spreadsheets needed

From the day I got the idea of researching spreadsheets for my PhD dissertation, I needed access to ‘real life’ spreadsheets, those as really used within companies. I approached a number of companies and naively thought they would share a few spreadsheets with me to look at. Boy was I wrong! By a pure turn of luck, I ran into the head of the Excel team at Robeco at a meetup one day, and my problem was solved, I could freely play around with their spreadsheets for 2 years. But, the community at large was still in need of a good test set, as I could not distribute the Robeco sheets any further. There is the EUSES set (paper, website), a set of about 5000 spreadsheets, which researchers can use to test their algorithms on, but it has a few downsides:

  • It is obtained mainly by Googling, hence not necessarily representative of industrial spreadsheets
  • It is quite small compared to modern data set that CS researchers study
  • There is no “context”: we do not know who made the spreadsheets and for what purpose

Stop, subpoena time! 

When I talked about this problem with Emerson Murphy Hill, he pointed me to the Enron Email Corpus. AFter Enron famously went bankrupt, the Federal Energy Regulatory Commission subpoenaed their emails to investigate, and this set was later acquired by Andrew McCallum, a computer scientist at the University of Massachusetts, and made available to the public. Because these emails were not voluntarily handed over, but gathered for evidence, they allow us a unique insight into a large organizations email behaviour, much like the Vesuvius preserved Pompeii exactly as it was in the year 79. Once I obtained the data set, I extracted all spreadsheets from the emails specifically. Turns out the set contains 265,586 attachment files of which 51,572 are Excel files. Among those files there were found 16,189 unique spreadsheets (based on MD5 file hash).

Many complex formulas, but small set of functions

One result when we looked at the spreadsheets is that there are bigger than the existing EUSES corpus in many ways:


When inspecting the spreadsheets further, we found many interesting things (read the whole paper for all the juice), but something that really stood out is that there was so little diversity in the spreadsheet functions that were used. About 75% of all spreadsheets used only the top 15 functions, and in the entire set, only 134 functions were used, while Excel has over 300.

100 spreadsheet emails a day

Looking into spreadsheet emailing behaviour, we found that email spreadsheets. or about spreadsheets was really common at Enron. Over the 15 months that the email set spans, we counted 100 emails per day (!) involving spreadsheets. Some emails occurred double in the set, as both the sender and receiver were in the mailboxes acquired, so it would be more fair to say there were 100 spreadsheet email – interactions a day. But still! Talking about errors in the spreadsheets was also pretty common, 6% of all spreadsheet related emails contained word such as error or fault, like:

“This was the original problem around the pipe option spreadsheets which we discovered yesterday and the reason why the numbers did not match between the old and new processes.”

“The EOL deal will error out in Spreadsheet – Natural Gas, therefore you won’t see it erroring out under Sitara.”

The dataset, online

The main reason why I put in the work of gathering the spreadsheets from the archive is to allow others to test their tools and algorithms on the data set as well. Therefore, I have uploaded the entire set of spreadsheets to my figshare page for everyone to explore and play with. Of course, everyone could extract them from the Enron set, but sharing it as a spreadsheet archive lowers the threshold to start, so here you go. The whole set is creative commons, so do what you like. I’d love to know what you’ll use it for though!

The full paper is available as a preprint on figshare.

