VLOOKUP considered harmful

If you have worked with spreadsheets a bit, you must have encountered lookup functions. The most common one is the VLOOKUP, which can be used to couple two tables:

simple_VLOOKUP

The formula in D2 searches for the vale “3” in the range F2:G6, and return a value from the second column (indicated by the third parameter that is 2)

What not everyone knows it that you can use LOOKUPs in another way as well, to classify data. That looks as follows:

vlookup

Again the formula looks up a value, but now it returns an approximate match. 83 does not occur in the range E2:E7, C is returned as the highest option still below 83.

By default, Excel (and most other spreadsheet systems too) return approximate matches as above: when the precise value is not found, the user gets the closest value below the search value. You can override this behavior by adding FALSE as fourth parameter. When doing that, Excel will return #N/A instead of an approximate match. I explain it in more detail in the video below:

I found the default very interesting because I have seen a lot of spreadsheets over the last few years, and my experience said that the first use is much more common. Since we have the wonderful Enron dataset on our hands now, we can finally check it out. In this post I give a quick overview of the results, but if you want all the details, have a look at the full paper.

What lookup use is more common?

As suspected, the use of the exact setting is more common than using TRUE or omitting the final parameter.

For the VLOOKUP functions the numbers are 5,198 approximate versus 7,010 exact and for HLKOOP those numbers are 976 versus 1,047. So in total, of the total 14,231 lookup functions in the entire Enron set, 43% is approximate and 57% is exact.

Are the approximate lookups needed?

Although it is already remarkable that there are more formulas not using the default than doing so, I must admit I was still surprised. Almost half of the uses was approximate? This did not comply with my ideas of what people used the VLOOKUp for. So, I dived into the approximate match sheets deeper, and found that, in many of the cases, approximate was used, but it was not needed (i.e. the search value occurred in the range precisely)

An example.

not_needed

The values in the search range are 0..16, and these are exactly the values that are being sought, in the formulas in B24:R24 (searching for the values in B5:R5 which are exactly 0..16) Hence, the user is performing an exact match here, but using the approximate setting.

This type of use was common,  in a staggering 4,792 (77.6%) of all approximate matches, the approximate setting was technically not needed. There is a benefit in using the approximate match, it is quicker (as binary search is used) but for small spreadsheets the difference is negligible, and one could wonder if most spreadsheet users know this.

The most likely explanation here is that users used the default, without changing it.

Who cares if people use the default?

Well, the interesting thing here is that there are risks associated with using the approximate match. It uses binary search and therefore, it requires the search range to be sorted. If not, funky things can happen. So, as a final research question, we looked into formulas using approximate match over an unsorted search range. Turns out, 23% of all approximate lookups searches over an unsorted search range. Often this is smelly, but harmless, but in 11 of the cases we found actual errors.

An remarkably deceiving example:

Years

Because the postfix “ year” is added to values in the lookup range, Excel interprets all values as string. Therefore the column, while appearing sorted, is not. Column I shows wrong results are returned for 10, 15 and 30 year forecasts.

So, in short: VLOOKUPs default seems to be chosen in the wrong way, often people use the default setting while exact match would also do and using the default setting over an unsorted range can result in errors. Want to know more? Full paper is here:

Encore: The history of LOOKUP functions

Of course one would now wonder: why is the default for VLOOKUP chosen as it is? We have to go way back for that. Visical, widely considered to be the first spreadsheet program, already supported a lookup function among the 20 functions it supported. Interestingly enough, it did not support an IF formula, so the LOOKUP was likely used in those days to model decision tables:

In the first version of Excel, they supported the VLOOKUP function like the lookup function in VisiCalc, there was no parameter and everything was always approximate. This remained the case upto Excel 4.

vlookup

The people at the Excel team were so kind as to make a copy of the Excel 4 manual for me, in which we see VLOOKUP without its fourth parameter.

You can read the preprint for all the details, to be plublished at VL/HCC 2015, or have a look at the slides I presented there:

 

1 Comment

  1. Guido Smeets

    Did you also investigate whether the users were doing some pre-processing before doing an exact match, where instead no preprocessing combined with an approximate match would suffice?

    It wouldn’t surprise me if the users aren’t actually fully aware of the API’s abilities. 😉

Comments are closed.