This is the first of a series of posts. Where we will use machine learning to rate movies. For this task we're not going to watch all the movies. I assume it's good enough to just read the plot. We'll use Markov chains to rate the movies and as an added bonus we can also generate new movie plots for awesome (or terrible) movies. In this first part we'll get the data and change it into a more usable format. We can use the data from IMDB, which is published on ftp://ftp.fu-berlin.de/pub/misc/movies/database/. Of interest are the plots and the ratings.

Plots look like this:

\-------------------------------------------------------------------------------
MV: Fear and Loathing in Las Vegas (1998)

PL: The big-screen version of Hunter S. Thompson's seminal psychedelic classic
PL: about his road trip across Western America as he and his large Samoan
PL: lawyer searched desperately for the "American dream"... they were helped in
PL: large part by the huge amount of drugs and alcohol kept in their
PL: convertible, The Red Shark.

BY: Laurence Mixson

PL: Raoul duke is a drug addled gonzo journalist. he is sent to cover a
PL: motorcycle race as an article for his magazine, but then the situation
PL: escalates into him and his psychotic attorney searching for the American
PL: dream, aided by almost every drug known to man in the boot of his red
PL: convertible.

BY: palmtreehead

PL: An adaptation of Hunter S. Thompson's novel of the same name. The film
PL: details a whacky search for the "American Dream", by Thompson and his
PL: crazed, Samoan lawyer. Fueled by the massive amount of drugs they purchased
PL: with an advance from a magazine to cover a sporting event in Vegas; they
PL: set out in the Red Shark. Encountering police, reporters, gamblers, racers,
PL: and hitchhikers; they search for some undefinable thing know only as the
PL: "American Dream" and find fear, loathing and hilarious adventures into the
PL: dementia of the modern American West.

BY: J. D. Keith ------------------------------------------------------------------------------- 

and from ratings.list:

0000001212  208460   7.7  Fear and Loathing in Las Vegas (1998)

We'll combine these and output one file with just a title, one plot and a rating in stars (1 to 5). Now it looks like this:

Fear and Loathing in Las Vegas (1998) The big-screen version of Hunter S. Thompson's seminal psychedelic classic about his road trip across Western America as he and his large Samoan lawyer searched desperately for the "American dream"... they were helped in large part by the huge amount of drugs and alcohol kept in their convertible, The Red Shark. 4

next time: Markov chains You can find all the code on github.

shadow-left