We’ll continue the plan to rate movies with Markov chains.
This time we predict ratings.

Classifier

We will learn 5 different Markov chains, one for each rating

object MovieClassifier {
  def learn(moviesLearnSet: List[Movie]): MovieClassifier = {
    val modelsByRating: Map[Rating, MarkovChain] = moviesLearnSet
      .groupBy(_.rating)
      .map { case (rating, ms) =>
        val model = MarkovChain.learn(ms.map(_.plot))
        (rating, model)
      }
    MovieClassifier(modelsByRating)
  }
}

To make a prediction for a new plot, we can calculate the probability to output that plot with each Markov chain. And select the highest

case class MovieClassifier(chains: Map[Rating, MarkovChain]) {
  def predict(plot: Plot): Rating =
    chains.maxBy { case (_, chain) =>
      chain.probabilityToOutput(plot)
    }._1
}

We read all our examples and split it into learn and testset

case class MovieSet(learnSet: List[Movie], testSet: List[Movie], dictionary: Dictionary)

object MovieSet {
  def readMovies(fraction: Int): MovieSet = {
    val totalTestSet = 283664
    val inputMovies = readRaw(totalTestSet / fraction)
    val dictionary = Dictionary.build(inputMovies.map(_.plot), 5)
    val allMovies: List[Movie] = Random.shuffle(inputMovies.map { m =>
      Movie(m.title, tokenize(m.plot, dictionary), m.rating)
    })
    val learnSet = allMovies.take((allMovies.size * 0.9).toInt)
    val testSet = allMovies.drop((allMovies.size * 0.1).toInt)
    MovieSet(learnSet, testSet, dictionary)
  }
}

and tie it all together

case class MovieClassifier(chains: Map[Rating, MarkovChain]) {
  def testAccuracy(movies: List[Movie]): Double = {
    val predictions = movies.map { movie =>
      (movie, predict(movie.plot))
    }
    val correctPredictions = predictions.count { case (m, p) => m.rating == p }
    correctPredictions.toDouble / predictions.length
  }
}

object MarkovMovieCritic extends App {
  val movies = MovieSet.readMovies(1)
  println(s"read movieSet $movies")

  println(s"learning models")
  val classifier = MovieClassifier.learn(movies.learnSet)

  println(s"accuracy on learnSet: ${classifier.testAccuracy(movies.learnSet)}")
  println(s"accuracy on testSet: ${classifier.testAccuracy(movies.testSet)}")
}

Test run

read movieSet MovieSet with learnSet:477 * 1 star,7773 * 2 star,47229 * 3 star,148726 * 4 star,51092 * 5 star, testSet:64 * 1 star,824 * 2 star,5136 * 3 star,16630 * 4 star,5713 * 5 star, Dictionary of 87882 words
learning models
accuracy on learnSet: 0.8665123366118678
accuracy on testSet: 0.2047802023477985

Horrible. Similar to random guessing (20%), or just labelling everything 4-star.
Because we get 90% on the learnset we’re overfitting.

If we change the minimum amount of word occurences for the dictionary from 5 to 5000, it becomes slightly better

read movieSet MovieSet with learnSet:488 * 1 star,7717 * 2 star,47129 * 3 star,148888 * 4 star,51075 * 5 star, testSet:53 * 1 star,880 * 2 star,5236 * 3 star,16468 * 4 star,5730 * 5 star, Dictionary of 557 words
learning models
accuracy on learnSet: 0.6049307277406315
accuracy on testSet: 0.3501956498748546
shadow-left