Markov movie critic - part 4 - classifier
We’ll continue the plan to rate movies with Markov chains.
This time we predict ratings.
Classifier
We will learn 5 different Markov chains, one for each rating
object MovieClassifier {
def learn(moviesLearnSet: List[Movie]): MovieClassifier = {
val modelsByRating: Map[Rating, MarkovChain] = moviesLearnSet
.groupBy(_.rating)
.map { case (rating, ms) =>
val model = MarkovChain.learn(ms.map(_.plot))
(rating, model)
}
MovieClassifier(modelsByRating)
}
}
To make a prediction for a new plot, we can calculate the probability to output that plot with each Markov chain. And select the highest
case class MovieClassifier(chains: Map[Rating, MarkovChain]) {
def predict(plot: Plot): Rating =
chains.maxBy { case (_, chain) =>
chain.probabilityToOutput(plot)
}._1
}
We read all our examples and split it into learn and testset
case class MovieSet(learnSet: List[Movie], testSet: List[Movie], dictionary: Dictionary)
object MovieSet {
def readMovies(fraction: Int): MovieSet = {
val totalTestSet = 283664
val inputMovies = readRaw(totalTestSet / fraction)
val dictionary = Dictionary.build(inputMovies.map(_.plot), 5)
val allMovies: List[Movie] = Random.shuffle(inputMovies.map { m =>
Movie(m.title, tokenize(m.plot, dictionary), m.rating)
})
val learnSet = allMovies.take((allMovies.size * 0.9).toInt)
val testSet = allMovies.drop((allMovies.size * 0.1).toInt)
MovieSet(learnSet, testSet, dictionary)
}
}
and tie it all together
case class MovieClassifier(chains: Map[Rating, MarkovChain]) {
def testAccuracy(movies: List[Movie]): Double = {
val predictions = movies.map { movie =>
(movie, predict(movie.plot))
}
val correctPredictions = predictions.count { case (m, p) => m.rating == p }
correctPredictions.toDouble / predictions.length
}
}
object MarkovMovieCritic extends App {
val movies = MovieSet.readMovies(1)
println(s"read movieSet $movies")
println(s"learning models")
val classifier = MovieClassifier.learn(movies.learnSet)
println(s"accuracy on learnSet: ${classifier.testAccuracy(movies.learnSet)}")
println(s"accuracy on testSet: ${classifier.testAccuracy(movies.testSet)}")
}
Test run
read movieSet MovieSet with learnSet:477 * 1 star,7773 * 2 star,47229 * 3 star,148726 * 4 star,51092 * 5 star, testSet:64 * 1 star,824 * 2 star,5136 * 3 star,16630 * 4 star,5713 * 5 star, Dictionary of 87882 words
learning models
accuracy on learnSet: 0.8665123366118678
accuracy on testSet: 0.2047802023477985
Horrible. Similar to random guessing (20%), or just labelling everything 4-star.
Because we get 90% on the learnset we’re overfitting.
If we change the minimum amount of word occurences for the dictionary from 5 to 5000, it becomes slightly better
read movieSet MovieSet with learnSet:488 * 1 star,7717 * 2 star,47129 * 3 star,148888 * 4 star,51075 * 5 star, testSet:53 * 1 star,880 * 2 star,5236 * 3 star,16468 * 4 star,5730 * 5 star, Dictionary of 557 words
learning models
accuracy on learnSet: 0.6049307277406315
accuracy on testSet: 0.3501956498748546