Leveraging Natural Language Analysis of Software: Achievements, Challenges, and Opportunities

Second ICSM keynote by Lori Pollock. She starts out by saying software is like a car in many ways:

  • It can break
  • We want new features
  • We want it to go faster
  • It gets more and more complicated under the hood

Thus: we need ‘power tools’ to fix and upgrade our software. In software, the power comes from some kind of automation, over some kind of artifact. It can be over source code (as most of us do), but also documentation, email, test suites, code changes.

Example of such power tools are code search tools or method summarizing tools. To create this power, several types of information can be used: control flow graphs, call graphs, program dependency graphs. Most analysis is done statically, however sometimes runtime information is also used. Now that code repositories are available, we can also use bug reports and version information.

But there is more!  We could apply natural language analysis to source code. Natural language occurs at several places in code: method names, parameter names, comments, class names. So not only does a word have a meaning, it also has a role. This can be leveraged for (obviously) code search, but also traceability between requirements and code.

A concrete example of applying natural language processing to source code is Dora a tool for code exploration.  Dora takes a natural language query as input and return places in the code related to this query: the relevant neighborhood. They first analyze structural analysis on the class structure and those results are pruned using natural language analysis.

Lori now takes a step back and explain text analysis in more detail. There are two different flavors:

Information/text retrieval (I/TR)for a query return as many relevant documents as possible
Natural language processing (NLP) create software that will analyze and understand natural language

Let’s look at comments in code, they already come in different types: notes, descriptions, explanation, cross-references and many more. It differs from natural language: not all comments are full sentences (often they start with a verb), some miss punctuation, sometimes they contain specific commands like a Java doc component. So if you are going to perform NLP on comments, you need to be aware of this subtle differences.

The same holds for identifiers, there are specific ways in which they are built. Often they consist of multiple words, camelCase of underscores are used to split, but not always, we use abbreviations and those can have different meanings in different contexts. With this information, we can guess the meaning: SortList will most likely sort a list, whereas SortedList will return whether a list is actually sorted.

Cool things we can do with NLP in SE are for instance lexicalization. Suppose you have a statement print(current) to understand this, you need to know what the context of current is. Here, type information is useful. Since current is of Type Document and with this information we can generate print current document, that has more information. But it can be more complicated. If the variable name is item and the type is selectable, the resulting sentence will be selected item. So the generated sentence has to do with part of speech too.

Problems that still exist are, for instance, splitting multiwords. Sometimes camelCase is used, but not everywhere. Dictionaries can be useful, but this problem is not yet solved, none of the current approaches has reached high precision yet.

The same holds for expanding abbreviations, this too is an open problem. One of the strategies that is currently applied is looking at text (comments or other things) nearby, to see of there is a word that might be the expansion.

Also, part-of-speech tagging is hard. Although this is sort of solved for normal text with the Stanford parser, on method names it is not so easy, since their format is different again from sentences.

So to conclude, there are a lot of opportunities where NLP can be improved, to be more applicable to SE artifacts.

2 Comments

  1. Adrian

    I was very much looking forward to hearing about this keynote and must admit that I am kind of disappointed. This seems like a very narrow view on NLP application for software engineering and the mentioned future work very much hints at an outdated rule-based view of NLP rather than using statistical analysis. Splitting words and extending abbreviations are pity technical problems, the real open question in NLP for software engineering is who to unleash the full power of statistics as hinted in ICSE 2012’s amazing “Naturalness of Software” paper …

    1. Felienne (Post author)

      I heard similar sentiment from people more into the NLP field. For me as an outsider it was a real gentle and interesting introduction.

      But I will definitely check out that ICSE paper. For others interested, it can be found online http://macbeth.cs.ucdavis.edu/natural.pdf

Comments are closed.