Massimiliano Di Penta – A Hidden Markov Model to Detect Coded Information Islands in Free Text

Developers communicate with a lot of free text, like emails, documentations, bug reports etc, of which some parts are source code (snippets, patches, examples) You want to separate the source form the natural language, in order to perform more detailed analysis on both.

Previous approaches for this have been regular expressions, island parsing or a combination of those. But, writing the right regular expression or building a parser is hard.

Massimiliano presents something new, learning from examples. This idea is based on Markov Models.

A Markov model is like a state machine, but with chances on the transitions. More specifically, it is based on a hiddenย Markov Model, a situation where we try to reconstructe the Markov Model from observations.

The trick is to consider the text as observable tokens. Their approach proves to work pretty well, especially if the algorithm was first trained on source code with similar conventions as the system under test.

The paper is on ResearchGate, but not with a pdf attached (yet?)ย Update: including a pdf ๐Ÿ™‚


  1. Max Di Penta

    Thanks for the nice description!
    PDF preprint just uploaded on ResearchGate ๐Ÿ˜‰

    1. Felienne (Post author)

      Great, updated my post ๐Ÿ™‚

Comments are closed.