Empirical Evaluation of Programming and Programming Language Constructs – Stefan Hanenberg

As you might know, I am a big fan of the work of Stefan, we had him over at Devnology last year, and I talked about his work at ALE and iTake.

In Computer Science, we often don’t assume certain developer behavior, but, very often, we don’t know whether it indeed occurs. A few examples:

2015-08-25 09.10.34

So all these authors are referring to human behavior and this behavior is essential to motivate an artefact (like OO, DSLs, types), but they do not present any evidence of this. Maybe they are right, but we don’t really know and there is a risk that they are wrong. And if the assumptions are wrong, some of our tools might be useless.

But that is not science!

No matter how intense a feeling of conviction it may be, it can never justify a statement. Thus I may be utterly convinced of the truth of a statement; certain of the evidence of my perceptions; overwhelmed by the intensity of my experience: every doubt may seem to me absurd. But does this afford the slightest reason for science to accept my statement? Can any statement be justified by the fact that K. R. P. is utterly convinced of its truth? The answer is, ‘No’; and any other answer would be incompatible with the idea of scientific objectivity.
Karl Popper, The Logic of Scientific Discovery

Different varieties in human-centered studies

There are different types of studies, for example, differences in what is being studied:

2015-08-25 09.25.56

But there is a another dimension on which studies differ and that is what is being observed and measured.

Quantitative studies measure quantities: this method is 3x more efficient than that one

Qualitative studies on the other hand collect data that cannot be directly mapped to quantities. These are typically interviews or think-aloud studies.

Quantitative studies

The focus of this briefing is quantitative studies: Stefan will teach us how to detect causal effects with hypothesis testing. This is a bit like studies in medicine, we give people a pill and then measure if they get better.

The logic of these types of studies is like this:

((Hypothesis -> Observation) /\ !Observation) -> !Hypothesis

If you do not observe the predicted effects, you can conclude that the hypothesis is wrong.  Typically, this is done on a small sample in a controlled setting.

A typical experiment is structured like this:  You measure the impact of dependent variable (i.e. PL) on a dependent variable (f.e. development time) Experiments typically suffer from confounding factors: factors you cannot control and might influence the results. Furthermore, there are ‘threats to validity’: flaws in the setup of the study that could impact the result, for example, the use of students instead of professional developers.

The good news is: you do not need to come up with a study design yourself, there are a lot of design already out there, but only a few are applicable to SE. A few examples:

One Factor Design with Two Alternatives:

  • Independent variable with two treatments (for example for-loops versus iterator)
  • Subjects are randomly assigned to A or B

A related setup with One Factor Design with N Alternatives, where you, obviously, use N different options.

Multi-factor Designs:

  • Easiest is 2×2 which results in four groups
  • Two independent variables, two treatments each

An example: you want to know if novices judge different language constructs differently. Then you have 2 factors: experience and the language constructs.

ABBA Crossover trial (AB within subject design)

  • Independent variable with two treatments(AB)
  • Two groups, one does AB, the other BA

In this design, single subjects get two treatments, for example, one user does a task with both for-loops and the iterator, and a second group does the same, but in the other order. This resolves the problem of unbalanced groups, but introduces a carry-over issue, if you have to do two similar tasks in a row, you might learn something from the first which is applicable on the second.

Analysis techniques

All these different designs test the same thing: the alpha-error. This is a probability of a type I error (false positive) If the p-value is small enough, the results are considered significant.

So what can we learn if the results are not significant? This means that we have not measured a difference, and most experimenters conclude for this that there was something wrong with the setup.

Different setups do use different techniques for obtaining the p-value:
2015-08-25 10.08.49

Problems

There are many possible influencing factors, and there are a few specific to SE studies, for example the 10x developer.

Example: What is the impact of language feature X instead of applying feature Y? Like

for(Person p: persons) {...}

versus

while(Iterator it = persons.iterator();
   it.hasNext();){
   person P = (Person) it.next();
}

A few steps that you might want to take when studying something like this:

2015-08-25 10.13.18

Small pilot studies are the way to go here! And this do not need to be real studies with a real sample, it can just be 1 person doing a task.

A few papers to look at:

How Do Programmers Use Optional Typing? An Empirical Study – Carlos Souza & Eduardo Figueiredo (PDF) A few interesting findings from this paper is that local variables were optionally types very often, and there were differences in method types (test methods, but also accessibility level)

What can we learn from a paper like this?

2015-08-25 10.27.11

An Empirical Investigation into Programming Language Syntax – Andreas Stefik and Susanna Siebert (no PDF found unfortunately)

These authors are designing a language themselves and they were curious about the quality of the language. What were the results?? Stefan will tell us in his next briefing Thursday.