Building a Movie Recommender - Part 2

This is the second post in a three-part series:

  1. Part 1 - we trained a model capable of recommending movies to users
  2. Part 2 (this post) - we will build a system to recommend movies to new users
  3. Part 3 - we will create an API and UI for our system and deploy them

Recap of Data and Model

Let us take a brief look again at the dataset we are using.

from fastai.collab import *
from fastai.tabular.all import *

path = untar_data(URLs.ML_100k)

ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user', 'movie', 'rating', 'timestamp'])

ratings.head()
user movie rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

Each row in this table consists of a rating that a user gave to a movie.

We can ignore the timestamp field for now since we didn't use it in our model.

Let us also take a quick look at the model we trained in the previous post.

learn = load_learner('movie-recommender-all-data.pkl')

learn.model
EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

The model we trained has the following four components:

  1. u_weight - a set of 50-dimensional vectors for each of the 944 users in the dataset.
  2. u_bias - a bias value for each of the 944 users.
  3. i_weight - a set of 50-dimensional vectors for each of the 1665 movies in the dataset.
  4. i_bias - a bias value for each of the 1665 movies.

The Bootstrapping Problem

Collaborative filtering systems like the movie recommender we are trying to create face the bootstrapping problem when one of two things occur:

  1. A new user signs up - which movies do we recommend to such a user?
  2. A new movie is added - to which users do we recommend the movie?

Signup Metadata

Many sites ask new users a set of questions when they sign up. This signup metadata can then be used with a model trained to predict the new user's embedding vector from this signup metadata. This vector can then be used to predict the rating the new user would give to movies.

Average of All User Embeddings

Another option would be to use an average of all the existing user embedding vectors as the new user's embedding vector.

However, this would probably give terrible recommendations in the beginning as it is unlikely that such an average would be close to representing any user in the real world.

Pick "Average User"

It is also possible to pick an existing user to represent the "average user" and provide recommendations based on this user's embedding vector. This works only if you have a way to pick such an average user. Even then, not all new user's tastes would match with this average user.

Recommending Movies to a New User

We will use a combination of some of the above approaches. We will follow a five-step process to recommend movies to a new user:

  1. Ask the user to rate movies they have already seen
  2. Use the user's ratings to find top 5 similar users
  3. Assign the new user a vector that is the mean of these 5 users' vectors
  4. Use this calculated vector to predict how the new user would rate other movies
  5. Return the top 5 movies with the highest predicted rating as recommendations

The more movies a user rates beforehand, the better the predictions will be. This wouldn't be ideal for a platform like Netflix because after users sign up, they wouldn't be too happy if they had to spend time rating movies they've watched instead of watching new ones. But for this toy project, this approach should be fine.

Let us assume a new user has provided the following ratings:

# Each tuple represents a (movie, rating) combination
user_ratings = [(15, 1), (16, 1), (18, 5), (19, 4), (242, 3)]

We first add these ratings to our dataset. We use the id 999 to represent a new user since this ID is not used in the original dataset.

user_ratings_dicts = []
for (movie_id, rating) in user_ratings:
    user_ratings_dicts.append({"user": 999, "movie": movie_id, "rating": rating})

new_ratings = ratings.append(user_ratings_dicts, ignore_index=True)
new_ratings.tail()
user movie rating timestamp
100000 999 15 1 NaN
100001 999 16 1 NaN
100002 999 18 5 NaN
100003 999 19 4 NaN
100004 999 242 3 NaN

The value of timestamp is not used so we can safely ignore the NaN values.

We can now cross-tabulate this data and see the entry created for our new user.

crosstab = pd.crosstab(new_ratings['user'], new_ratings['movie'], values=new_ratings['rating'], aggfunc='sum').fillna(0)
crosstab.tail()
movie 1 2 3 4 5 6 7 8 9 10 ... 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682
user
940 0.0 0.0 0.0 2.0 0.0 0.0 4.0 5.0 3.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
941 5.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
942 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
943 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1682 columns

Our aim is to now compare the new user (last row) with all other users. We can do this by finding the cosine similarity between the last row and the rest of the rows.

other_users = crosstab.values[:-1]
new_user = crosstab.values[-1].reshape(1, -1)

similarities = nn.CosineSimilarity()(tensor(other_users), tensor(new_user))
similarities[:5]
tensor([0.1429, 0.1236, 0.0000, 0.0000, 0.0000])

A user that has given similar ratings to movies as the new user will have a higher cosine similarity value.

We now choose the top 5 users in our dataset that like (and dislike) the same movies as our new user.

top5 = similarities.topk(5)
top5
torch.return_types.topk(
values=tensor([0.2195, 0.2195, 0.1843, 0.1701, 0.1693]),
indices=tensor([305, 593, 130, 724,  13]))

Note: the similarity values are not so great because we've provided only a handful of ratings for the new user. The more ratings we provide for this user, the more likely it is to find a match with an existing user in our dataset that has similar taste.

We then calculate the mean of the vectors of these 5 users to generate a vector that represents our new user.

# We add 1 to the indices to compensate for #na# in the model
user_vectors = learn.u_weight.weight[1+top5.indices,:]

new_user_vector = user_vectors.mean(dim=0, keepdim=True)
new_user_vector
tensor([[ 0.0065, -0.1063, -0.1882, -0.1264, -0.1254, -0.1584, -0.0886,  0.0651,
         -0.0840, -0.1263, -0.0949,  0.1772, -0.0982, -0.0798,  0.1242,  0.0993,
          0.0164,  0.0501, -0.0850,  0.1100,  0.0925, -0.0747, -0.0946, -0.0541,
          0.0779,  0.0114,  0.0725,  0.1328, -0.1182, -0.0824, -0.1246, -0.0903,
          0.0949, -0.0944, -0.1412,  0.1149, -0.0805,  0.0979, -0.1231,  0.1039,
          0.0071,  0.1134,  0.1999,  0.0573, -0.0493,  0.1578, -0.0366,  0.0581,
         -0.0643,  0.0364]], grad_fn=<MeanBackward1>)

We can also use the mean value of the bias of these 5 users as the bias value for the new user.

user_biases = learn.u_bias.weight[1+top5.indices,:]
new_user_bias = user_biases.mean()
new_user_bias
tensor(0.1654, grad_fn=<MeanBackward0>)

With values for the vector and bias of our new user, we can now predict the rating for all movies in our dataset.

pred_ratings = torch.matmul(new_user_vector, learn.i_weight.weight.T) + learn.i_bias.weight.T + new_user_bias
pred_ratings
tensor([[ 0.1632, -0.2196,  0.3719,  ..., -0.1531,  0.4604,  0.2233]],
       grad_fn=<AddBackward0>)

We finally consider the top 5 movies with the highest predicted rating and extract their names from the model.

top5_ratings = pred_ratings.topk(5)
recommendations = learn.classes['title'][top5_ratings.indices.tolist()[0]]

for i, movie in enumerate(recommendations):
    print(f'{i+1}. {movie}')
1. Close Shave, A (1995) 2. Wrong Trousers, The (1993) 3. Schindler's List (1993) 4. Shawshank Redemption, The (1994) 5. Usual Suspects, The (1995)

Next steps

In this post, we looked at:

  • The bootstrapping problem in collaborative filtering models
  • Some ways of addressing the bootstrapping problem
  • One simple approach to recommend movies to new users

In the next post, we'll deploy our model to a cloud service and create a simple web application that users can use to rate movies they have seen and get movie recommendations using the approach we have looked at in this post.

© 2021 Ravi Suresh Mashru. All rights reserved.