Markov movie critic - part 2 - tokenization

We’ll continue the plan to rate movies with Markov chains.
This time we’ll tokenize the input.

Previous blog post
All posts about Markov chains
Code

It’s been a while, so there were some steps to take to get this up-to-date and running again.

There have even been a few new movies in the meantime. Sadly the format changed so that would require some work to use the new format. But luckily they still keep older data in the old format. We’ll go with this for now :)

Tokenization

The next step is to tokenize the plots. This means a few things

We deal with signs. Some are interesting, we make them into tokens, others we drop.

  case class SignToken(sign: String) extends Token

  val tokenSignsRegex: Regex = """([(),;!?.])""".r //These become tokens
  val ignoredSignsRegex: Regex = """[:"&]""".r //We drop these

We separate all the words. Then compare them against a dictionary of usable words. All the usable words become tokens. For the others, we forget the actual word and make it into an InfrequentWord token.

  case class WordToken(word: String) extends Token
  object InfrequentWord extends Token

We could use an external dictionary for this, but I think it’s convenient to compile it from the complete list of plots. This has the advantage that we ignore all infrequent words in our scenario.
We do this to reduce overfitting. Our model cannot learn much from the few instances where these words occur, so it shouldn’t base it’s output on them. Also this should increase speed.
For now I choose a minimum of 5 occurrences. This should be tweaked later on.

Finally, we start all our plots with a StartToken and end with an EndToken

  object StartToken extends Token
  object EndToken extends Token

This means a plot like Hello world! will become a List of Tokens:

  List(StartToken, WordToken("hello"), WordToken("world"), SignToken("!"), EndToken)

Perfect for learning a Markov chain!