We’ll continue the plan to rate movies with Markov chains.
This time we’ll tokenize the input.
It’s been a while, so there were some steps to take to get this up-to-date and running again.
There have even been a few new movies in the meantime. Sadly the format changed so that would require some work to use the new format. But luckily they still keep older data in the old format. We’ll go with this for now :)
The next step is to tokenize the plots. This means a few things
We deal with signs. Some are interesting, we make them into tokens, others we drop.
case class SignToken(sign: String) extends Token val tokenSignsRegex: Regex = """([(),;!?.])""".r //These become tokens val ignoredSignsRegex: Regex = """[:"&]""".r //We drop these
We separate all the words. Then compare them against a dictionary of usable words. All the usable words become tokens.
For the others, we forget the actual word and make it into an
case class WordToken(word: String) extends Token object InfrequentWord extends Token
We could use an external dictionary for this, but I think it’s convenient to compile it from the complete list of plots.
This has the advantage that we ignore all infrequent words in our scenario.
We do this to reduce overfitting. Our model cannot learn much from the few instances where these words occur, so it shouldn’t base it’s output on them. Also this should increase speed.
For now I choose a minimum of 5 occurrences. This should be tweaked later on.
Finally, we start all our plots with a
StartToken and end with an
object StartToken extends Token object EndToken extends Token
This means a plot like
Hello world! will become a List of Tokens:
List(StartToken, WordToken("hello"), WordToken("world"), SignToken("!"), EndToken)
Perfect for learning a Markov chain!