Detecting Code Smells in Spreadsheet Formulas

We just learned this paper is accepted at ICSM 2012.

The idea

The basic idea of this paper is to investigate whether we could apply the known code  smells invented by Martin Fowler in his book Refactoring to spreadsheet formulas. You could view a formula as a small piece of code, so it makes sense to assume this can ‘smell’ just as code can.

The smells

But, not all smells on his catalog are applicable to spreadsheets. Some, like Refused Bequest, have to do with inheritance, and that concept does not really exist in spreadsheets. And in a previous paper, we have already investigated smells between worksheets. This leaves us with the following smells that could occur in formulas:

  1. Multiple Operations Analogous to a long method, a formula with many different operations will likely be harder to understand than a shorter one. Especially since in most spreadsheet programs, there is limited space to view a formula, causing long formulas to be cut off.
  2. Multiple References Another code smell we use as a basis is the Many parameters code smell. A method that uses many input values might be split into multiple methods to improve readability. The formula equivalent of this smell occurs when a formula references many different other cells, like SUM(A1:A5; B7;C18;C19;F19)
  3. Conditional Complexity Fowler states that the nesting of many conditional operations should be considered a threat to code readability, since conditionals are hard to read. Since spreadsheet systems also allow for the use of conditional, spreadsheet formulas are at risk of this treat too.
  4. Long Calculation Chain Spreadsheet formulas can refer to each other, hence creating chains of calculation. To understand the meaning of such a formula, a spreadsheet user has to trace along multiple steps to find the origin of the data. This is a task that spreadsheet users find tedious.
  5. Duplicated Formula This smell indicates that similar snippets of code are used throughout a class. This is a concept common in spreadsheets too, where some formulas are partly the same as others.
Measuring the smells

Now we have the smells, we need to come up with a way to measure them, in order to detect what cells are smelly. We did that by analyzing a large body of spreadsheets (the EUSES corpus) and setting smelliness thresholds based on the values in that set. We used 70, 80 and 90% as thresholds, since these are values that are common in source code smells. In the evaluation we validated whether those thresholds made sense.

Evaluation

For the evaluation, first we analyzed the EUSES corpus a second time, to check the thresholds. We set them based on the formulas, and we wanted to know how many spreadsheets were smelly, given those thresholds. This is important, since we would not want to be in a situation where all, or no spreadsheets were smelly, that would mean we had to adjust the thresholds. Luckily, the results were promising, not too many smelly spreadsheet, but also not too few:

Then, we went onto the second evaluation, in which we analyzed 10 real life spreadsheet and interviewed their owners. In this evaluation we learned that there is little awareness of the risks of complicated formulas (like  Multiple Operations). Spreadsheet users seem not to be concerned too much about maintainability of formulas. They keep extending formulas with more operations and more references, causing formulas to become long and complicated.

Furthermore, we found two actual faults in a spreadsheet by looking at the Duplication Smell. With respect to the other smells, the concern caught is lack of understandability. Spreadsheet users found that our current
smell detection strategies reveal the formulas that are the least maintainable. These formulas will be time consuming to change, and changes made will be more error prone.

Conclusion

In this paper again we have found that principles from software engineering transfer well to the spreadsheet paradigm. The awareness of spreadsheet users for quality is very low, so automated analysis tools could really help spreadsheet users to find and improve complex formulas.

Want to know even more? Post a comment below or read the preprint (PDF).

BibTex

@inproceedings{Hermans122,
   author = {Felienne Hermans and Martin Pinzger and Arie van Deursen},
   title = {Detecting Code Smells in Spreadsheet Formulas},
   booktitle = {ICSM},
   year = {2012},
   crossref = {DBLP:conf/icsm/2012},
   bibsource = {DBLP, http://dblp.uni-trier.de},
   note = {to appear}}
@proceedings{DBLP:conf/icsm/2012,
   title = {IEEE 28th International Conference on Software Maintenance, ICSM 2012, Riva del Garda, Trento,Italy, September 23-30, 2012},
   booktitle = {ICSM},
   publisher = {IEEE},
   year = {2012},
   bibsource = {DBLP, http://dblp.uni-trier.de}}