Building a Movie Recommender - Part 1

An Introduction to Collaborative Filtering

Collaborative filtering is a technique used by recommendation engines to recommend items to users, e.g. products on Amazon and movies/series on Netflix.

Recommendations from Netflix based on my viewing history

The basic idea behind collaborative filtering is the following:

Consider the items that the current user has liked
Find other users that have liked similar items
Recommend items those users have liked

In this 3-part series of posts, we will create our own movie recommendation system:

Part 1 (this post) - we will train a model capable of recommending movies to users
Part 2 - we will build a system to recommend movies to new users
Part 3 - we will create an API and UI for our system and deploy them

The MovieLens Dataset

The MovieLens dataset contains millions of ratings of movies. For simplicity, we'll use a subset of this dataset that contains 100,000 ratings to build a model.

from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)

ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user','movie','rating','timestamp'])
ratings.head()

	user	movie	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

Each row in this dataset is the rating a user has given to a particular movie. We need to build a model that will predict the rating a user will give to a movie they haven't watched.

Another way to view this data is by cross-tabulating it - making a table where the rows represent users and columns represent movies.

pd.crosstab(ratings.user, ratings.movie, values=ratings.rating, aggfunc='sum')

movie	1	2	3	4	5	6	7	8	9	10	...	1673	1674	1675	1676	1677	1678	1679	1680	1681	1682
user
1	5.0	3.0	4.0	3.0	3.0	5.0	4.0	1.0	5.0	3.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	4.0	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
939	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
940	NaN	NaN	NaN	2.0	NaN	NaN	4.0	5.0	3.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
941	5.0	NaN	NaN	NaN	NaN	NaN	4.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
942	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
943	NaN	5.0	NaN	NaN	NaN	NaN	NaN	NaN	3.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

943 rows × 1682 columns

In this view of the data, we need to predict all the missing values (shown as NaN).

If we had properties to describe each movie, for example the genre and we also knew how much each user liked each genre, our job would be very easy. We could just recommend movies that the user hasn't watched but would like because of their genre preference.

However, we don't have this kind of information about the movies or users. The only thing we know is how much each user liked the movies they rated.

What we can do is create vectors (with randomly initialized values) for each user and movie. Then, instead of trying to determine what each component of the vector means, we can then use the data available to use to learn what the values of these vectors should be using gradient descent.

Such vectors are also commonly known as latent factors or embeddings.

Learning representations of movies and users

In general, gradient descent works as follows:

Step 1: Initialize parameters

In this case, the vectors for our users and movies are the parameters that we initialize with random values.

Step 2: Calculate predictions

We need to predict the rating a user gives to a movie they have watched. A straightforward way to calculate these predictions from the user and movie vectors is using the dot product.

This requires the vectors that represent users and movies to be of the same size. It is possible to use vectors of a different length for movies and a different length for users if we use a neural network instead of a dot product. We will look at this approach later in this post.

Step 3: Calculate the loss

The value of the loss tells us how far away our model's predictions are from the actual ratings provided by the user. Typically, a high loss value is bad and the objective of the gradient descent process is the iteratively minimize the loss value.

This difference between the actual rating value and the model's prediction is also known as the mean absolute error. Another way to calculate the loss would be using the mean squared error.

Step 4: Optimize parameters

The final step in the process is to update the user and movie vectors such that the loss is minimized.

Building a model

Building a collaborative filtering model using the fastai library is extremely easy using the fastai.collab module.

But first, let us add names of the movies to our data so that since they are easier to understand compared to the IDs.

movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                         usecols=(0,1), names=('movie','title'), header=None)
ratings = ratings.merge(movies)
ratings.head()

	user	movie	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

Then, we need to create DataLoaders for our data.

from fastai.collab import *

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

	user	title	rating
0	113	Twelve Monkeys (1995)	3
1	181	Chasing Amy (1997)	1
2	90	Eat Drink Man Woman (1994)	3
3	506	Tombstone (1993)	4
4	189	Citizen Kane (1941)	5
5	749	Black Beauty (1994)	3
6	246	Jack (1996)	2
7	119	Family Thing, A (1996)	4
8	707	Rebel Without a Cause (1955)	2
9	128	In the Line of Duty 2 (1987)	5

We now create a model using collab_learner.

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

In addition to the dataloaders, we specified the following information in this line:

A 50-dimensional vector should be used to represent each movie and user.
The output of the model should be between 0 and 5.5. Although the ratings in our dataset go from 0 to 5, using a slightly higher upper bound seems to work a little better practically, according to Fastbook.

We now use the fit_one_cycle method to train the model.

learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.927610	0.938143	00:07
1	0.859944	0.874978	00:07
2	0.729279	0.834507	00:07
3	0.604334	0.820646	00:07
4	0.485712	0.821522	00:07

We can now export this model so that we can use it to make predictions.

learn.export('movie-recommender.pkl')

Chapter 8 of Fastbook is an excellent resource to understand how this model works under the hood and how to build such a model from scratch.

Using deep learning for collaborative filtering

The approach that we've considered so far (using the dot product between movie and user vectors to predict a user's rating) is also known as probabilistic matrix factorization (PMF).

Another approach would be to use a neural network. As mentioned previously, when using neural networks the size of the user and movie vectors can be different.

Fastai provides the get_emb_sz function that uses some heuristics to suggest the length of vectors that can be used.

get_emb_sz(dls)

[(944, 74), (1665, 102)]

The recommendation for our dataset is to use:

A vector of length 74 for each user
A vector of length 102 for each movie

The reason these vectors don't need to be of the same length is that we concatenate them.

Using neural networks for collaborative filtering in Fastai is super-easy! We still use collab_learner like before, but we set use_nn=True. We can also specify the size of the layers in the network.

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.962094	0.988195	00:11
1	0.931443	0.912817	00:11
2	0.850018	0.879676	00:11
3	0.815401	0.851548	00:11
4	0.769558	0.856113	00:10

This approach allows us to include more information that may be relevant to make predictions (e.g. date and time of rating). All we need to do is concatenate that information to the input vector and train the neural network!

Next steps

In this post, we covered the following:

How collaborative filtering works
How we can use gradient descent to learn user and movie vectors
How to use fastai to train a movie recommendation model
How to use neural networks for collaborative filtering

In the next post, we'll use the model we've trained to recommend movies to new users.