Corpus -based Empirical Software Engineering – Ekaterina Pek

The motivation for Kate’s work, she tells us, is the work of Knuth who empirically studied punchcards with FORTRAN code, in order to discover ‘what programmers really do’, as opposed to ‘what programmers should do’

Kate has the same goal: she wants to measure use of languages:

  • frequency counts -> How often are parts of the language used?
  • coverage -> What parts of the language are used?
  • footprint -> How much of each language part is used?

In order to be able to perform such analyses, we need a ‘corpus’ a big set of language data to work on. Knuth even collected punch cards from garbage bins, because it was so important for him to get more data.

And it is not just code she looked at, also libraries, bugs, emails and commits are taken into account. But some have to be sanitized in order to be usable for the corpus.


P3P is a domain specific language for privacy definitions. Kate wanted to know how it is used and if it is used correctly. For this, the crawled 4 million websites.

An interesting find in this paper is the fact that Kate found many many clones. Not just in syntax and semantics, but also in textual headers, which were supposed as description (so cloning does not make sense) So maybe a language is not needed, a few choices might suffice.

Corpus engineering

To understand the need for corpora, Kate did a literature survey, in which she looked into how often corpora are currently used, and how. For this, Kate used a grounded theory approach in which she coded papers and let a schema emerge. Papers from CSMR, ESEM, ICPC, ICSM, MSR, SCAM and WCRE were analyzed. The schema covered:

  • Corpora
  • Form
  • Tools
  • Self-classifications
  • Structural signs of quality
  • Reproducability
  • Assessment

Results: 94% of the papers use a corpus and 83% use product-based corpora. On average a corpora = 3 Java projects. The most popular project appears in only 8% of the papers (so there is a lot of variety)

Some more results:


Corpus reengineering

So, in conclusion, corpora are very important to SE research. Kate’s work improves reproducability and comparability and reduces effort, by separating effort for making the corpus from the effort of the analysis.

The main message is that research-driven research is what the SE community is missing. We should inspect existing practices more, learn from what is already done and improve upon that.