Developers communicate with a lot of free text, like emails, documentations, bug reports etc, of which some parts are source code (snippets, patches, examples) You want to separate the source form the natural language, in order to perform more detailed analysis on both.
Previous approaches for this have been regular expressions, island parsing or a combination of those. But, writing the right regular expression or building a parser is hard.
Massimiliano presents something new, learning from examples. This idea is based on Markov Models.
A Markov model is like a state machine, but with chances on the transitions. More specifically, it is based on a hidden Markov Model, a situation where we try to reconstructe the Markov Model from observations.
The trick is to consider the text as observable tokens. Their approach proves to work pretty well, especially if the algorithm was first trained on source code with similar conventions as the system under test.