Inferring Method Specifications from Natural Language API Descriptions

The final session of ICSE 2012 is there. Always a very hard task to keep people awake and interested, but luckily the topic is really cool. Generating code from natural language has been a topic of interest for a long time (see Stephan Wolframs blog post on Wolfram Alpha)

Challenges of extracting specification from natural language are, according to the authors, that language is often incomplete and imprecise. No surprises so far.

The approach of this paper consists of several steps, aimed at improving precision at every step.

  1. First sentences as chopped up (“Password should be longer than 6 characters and contain a number” is transformed into “Password should be longer than 6 characters” and “Password should contain  number”
  2. Also, language is made more precise (for instance replacing “and/or” by “or”)
  3. Then, a text analysis algorithm turns words into categories:  subjects, verbs and clauses
  4. Subsequently, graph is made, representing relationships between the subjects, verbs and clauses
  5. And a postprocessor refines this tree, for instance turning ‘is no longer’ into two different objects: negation and equivalence
  6. From the refined tree, specifications are created by mapping this tree onto a template from the set of templates. These templates, for instance are ‘subject should not be null’ or ‘subject is <list of clauses>’

The authors performed an evaluation, in which the compared their specification to specs written by human, and their method reaches 92% precision and 93% recall in identifying sentences from more than 2500 sentences of API documents and an 83% accuracy. Pretty impressive and interesting work.