Movie Recommender System

- 15 mins

Recomender System/Engine

A recommender system or a recommendation system is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item.Recommender systems are the core information filtering system designed to predict the user preference and help to recommend correct items to create a user-specific personalization experience.

Types /Approaches

There are two types of recommendation systems:

collaborative filtering

Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself

content-based filtering

Content-based filtering methods are based on a description of the item and a profile of the user’s preferences. In a content-based recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended

Data source :

https://grouplens.org/datasets/movielens/

The Notebook

Notebook for this work can be found at Recomender System Ipython notebook

import libraries

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances
df_rating = pd.read_json('movies_rating.json')
df_rating = df_rating[['userId','movieId','rating']]

df_mv = pd.read_csv('movies.csv')
print df_rating.shape

print df_mv.shape
(84730, 3)
(9125, 3)
new_index = range(len(df_rating))
df_rating = df_rating.reindex(new_index)
df_rating.dropna(inplace=True)
print df_rating.shape
(72129, 3)

convert userID and movieID to integer

df_rating.userId = df_rating.userId.astype(int)
df_rating.movieId = df_rating.movieId.astype(int)
print df_rating.head(10)
   userId  movieId  rating
0       1       31     2.5
1       1     1029     3.0
2       1     1061     3.0
3       1     1129     2.0
4       1     1172     4.0
5       1     1263     2.0
6       1     1287     2.0
7       1     1293     2.0
8       1     1339     3.5
9       1     1343     2.0

checknumber of users and number of movies

num_users = df_rating.userId.unique().shape[0]
num_movies = df_rating.movieId.unique().shape[0]
print '\nNumber of users = ' + str(num_users) + ' | Number of movies = ' + str(num_movies)
Number of users = 566 | Number of movies = 5411
max_movie_id = df_rating.movieId.max()
min_movie_id = df_rating.movieId.min()

max_user_id = df_rating.userId.max()
min_user_id  = df_rating.userId.min()
print "Max MovieID : %d | Min MovieID : %d " %(max_movie_id,min_movie_id)
print "Max UserID : %d | Min UserID : %d " %(max_user_id,min_user_id)
Max MovieID : 9000 | Min MovieID : 1 
Max UserID : 566 | Min UserID : 1 
#

check first ten movies title

df_movies.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
10 11 American President, The (1995) Comedy|Drama|Romance
11 12 Dracula: Dead and Loving It (1995) Comedy|Horror
12 13 Balto (1995) Adventure|Animation|Children
13 14 Nixon (1995) Drama
14 15 Cutthroat Island (1995) Action|Adventure|Romance
15 16 Casino (1995) Crime|Drama
16 17 Sense and Sensibility (1995) Drama|Romance
17 18 Four Rooms (1995) Comedy
18 19 Ace Ventura: When Nature Calls (1995) Comedy
19 20 Money Train (1995) Action|Comedy|Crime|Drama|Thriller
20 21 Get Shorty (1995) Comedy|Crime|Thriller
21 22 Copycat (1995) Crime|Drama|Horror|Mystery|Thriller
22 23 Assassins (1995) Action|Crime|Thriller
23 24 Powder (1995) Drama|Sci-Fi
24 25 Leaving Las Vegas (1995) Drama|Romance
25 26 Othello (1995) Drama
26 27 Now and Then (1995) Children|Drama
27 28 Persuasion (1995) Drama|Romance
28 29 City of Lost Children, The (Cité des enfants p... Adventure|Drama|Fantasy|Mystery|Sci-Fi
29 30 Shanghai Triad (Yao a yao yao dao waipo qiao) ... Crime|Drama
... ... ... ...
9095 159690 Teenage Mutant Ninja Turtles: Out of the Shado... Action|Adventure|Comedy
9096 159755 Popstar: Never Stop Never Stopping (2016) Comedy
9097 159858 The Conjuring 2 (2016) Horror
9098 159972 Approaching the Unknown (2016) Drama|Sci-Fi|Thriller
9099 160080 Ghostbusters (2016) Action|Comedy|Horror|Sci-Fi
9100 160271 Central Intelligence (2016) Action|Comedy
9101 160438 Jason Bourne (2016) Action
9102 160440 The Maid's Room (2014) Thriller
9103 160563 The Legend of Tarzan (2016) Action|Adventure
9104 160565 The Purge: Election Year (2016) Action|Horror|Sci-Fi
9105 160567 Mike & Dave Need Wedding Dates (2016) Comedy
9106 160590 Survive and Advance (2013) (no genres listed)
9107 160656 Tallulah (2016) Drama
9108 160718 Piper (2016) Animation
9109 160954 Nerve (2016) Drama|Thriller
9110 161084 My Friend Rockefeller (2015) Documentary
9111 161155 Sunspring (2016) Sci-Fi
9112 161336 Author: The JT LeRoy Story (2016) Documentary
9113 161582 Hell or High Water (2016) Crime|Drama
9114 161594 Kingsglaive: Final Fantasy XV (2016) Action|Adventure|Animation|Drama|Fantasy|Sci-Fi
9115 161830 Body (2015) Drama|Horror|Thriller
9116 161918 Sharknado 4: The 4th Awakens (2016) Action|Adventure|Horror|Sci-Fi
9117 161944 The Last Brickmaker in America (2001) Drama
9118 162376 Stranger Things Drama
9119 162542 Rustom (2016) Romance|Thriller
9120 162672 Mohenjo Daro (2016) Adventure|Drama|Romance
9121 163056 Shin Godzilla (2016) Action|Adventure|Fantasy|Sci-Fi
9122 163949 The Beatles: Eight Days a Week - The Touring Y... Documentary
9123 164977 The Gay Desperado (1936) Comedy
9124 164979 Women of '69, Unboxed Documentary

9125 rows × 3 columns

check movies unique title

len(df_mv.title.unique())
9123
len(df_mv.movieId)
9125
id_2_movie = {}
for movie_id in df_mv.movieId:
  index = df_mv.index[movie_id-1]
  if index > max_movie_id:
    break
  id_2_movie[movie_id] = df_mv.loc[index].title

Create user-movies similarity matrices

cnt = 0
for line in df_rating.itertuples():
    print line
    cnt +=1
    if cnt > 20:
        break
Pandas(Index=0, userId=1, movieId=31, rating=2.5)
Pandas(Index=1, userId=1, movieId=1029, rating=3.0)
Pandas(Index=2, userId=1, movieId=1061, rating=3.0)
Pandas(Index=3, userId=1, movieId=1129, rating=2.0)
Pandas(Index=4, userId=1, movieId=1172, rating=4.0)
Pandas(Index=5, userId=1, movieId=1263, rating=2.0)
Pandas(Index=6, userId=1, movieId=1287, rating=2.0)
Pandas(Index=7, userId=1, movieId=1293, rating=2.0)
Pandas(Index=8, userId=1, movieId=1339, rating=3.5)
Pandas(Index=9, userId=1, movieId=1343, rating=2.0)
Pandas(Index=10, userId=1, movieId=1371, rating=2.5)
Pandas(Index=11, userId=1, movieId=1405, rating=1.0)
Pandas(Index=12, userId=1, movieId=1953, rating=4.0)
Pandas(Index=13, userId=1, movieId=2105, rating=4.0)
Pandas(Index=14, userId=1, movieId=2150, rating=3.0)
Pandas(Index=15, userId=1, movieId=2193, rating=2.0)
Pandas(Index=16, userId=1, movieId=2294, rating=2.0)
Pandas(Index=17, userId=1, movieId=2455, rating=2.5)
Pandas(Index=18, userId=1, movieId=2968, rating=1.0)
Pandas(Index=19, userId=1, movieId=3671, rating=3.0)
Pandas(Index=20, userId=2, movieId=10, rating=4.0)
df_matrix = np.zeros((num_users, max_movie_id))
for line in df_rating.itertuples():
    df_matrix[line[1]-1, line[2]-1] = line[3]
user_similarity = pairwise_distances(df_matrix, metric='cosine')
movie_similarity = pairwise_distances(df_matrix.T, metric='cosine')
user_similarity[:5]
array([[0.        , 1.        , 1.        , ..., 0.91832729, 1.        ,
        1.        ],
       [1.        , 0.        , 0.8665161 , ..., 0.82637126, 0.90441205,
        1.        ],
       [1.        , 0.8665161 , 0.        , ..., 0.87694711, 0.92983349,
        0.91836046],
       [0.92551755, 0.88117897, 0.91232459, ..., 0.69537101, 0.9291931 ,
        1.        ],
       [0.98197165, 0.88889461, 0.82555476, ..., 0.88217131, 0.96200923,
        1.        ]])
movie_similarity[:5]
array([[0.        , 0.59828933, 0.7080403 , ..., 0.87482476, 0.9309266 ,
        0.9309266 ],
       [0.59828933, 0.        , 0.80294811, ..., 0.95590021, 1.        ,
        1.        ],
       [0.7080403 , 0.80294811, 0.        , ..., 1.        , 1.        ,
        1.        ],
       [0.86907562, 0.81043032, 0.80491616, ..., 1.        , 1.        ,
        1.        ],
       [0.74970641, 0.74490917, 0.64807423, ..., 0.86215254, 0.85079341,
        0.85079341]])
# Top 10 similar users for user id 200
print "users similar to user id 200: \n", pd.DataFrame(user_similarity).loc[199,pd.DataFrame(user_similarity).loc[199,:] > 0].sort_values(ascending=False)[0:10]

users similar to user id 200: 
565    1.0
548    1.0
34     1.0
70     1.0
75     1.0
134    1.0
157    1.0
275    1.0
309    1.0
324    1.0
Name: 199, dtype: float64
# Top 5 similar movie for movie id 9
print "movie similar to movie id 9: \n", pd.DataFrame(movie_similarity).loc[9,pd.DataFrame(movie_similarity).loc[9,:] > 0].sort_values(ascending=False)[0:5]

movie similar to movie id 9: 
4500    1.0
5085    1.0
5072    1.0
5075    1.0
5076    1.0
Name: 9, dtype: float64
mv = pd.DataFrame(movie_similarity).loc[9,pd.DataFrame(movie_similarity).loc[9,:] > 0].sort_values(ascending=False)[0:5]
print "\t\t\t movie |  %s  | is similar to " % id_2_movie[31]
for i in mv.index:
  print id_2_movie[i]
       movie |  Dangerous Minds (1995)  | is similar to 
Man Who Fell to Earth, The (1976)
Miracle (2004)
Boy and His Dog, A (1975)
Tormented (1960)
Chitty Chitty Bang Bang (1968)

Building the Recommendation engine

# Function for item based rating prediction
def movie_based_prediction(rating_matrix, similarity_matrix):
    return rating_matrix.dot(similarity_matrix) / np.array([np.abs(similarity_matrix).sum(axis=1)])


# Function for user based rating prediction
def user_based_prediction(rating_matrix, similarity_matrix):
    mean_user_rating = rating_matrix.mean(axis=1)
    ratings_diff = (rating_matrix - mean_user_rating[:, np.newaxis])
    return mean_user_rating[:, np.newaxis] + similarity_matrix.dot(ratings_diff) / np.array([np.abs(similarity_matrix).sum(axis=1)]).T


movie_based_prediction = movie_based_prediction(df_matrix, movie_similarity)
user_based_prediction = user_based_prediction(df_matrix, user_similarity)

Recomendation using collaborative filtering

user_based = pd.DataFrame(user_based_prediction)
predictions = user_based.loc[5,pd.DataFrame(df_matrix).loc[5,:] == 0]
top = predictions.sort_values(ascending=False).head(n=5)
recommendations = pd.DataFrame(data=top)
recommendations.columns = ['Predicted Rating']
print "Prediction for Movies That User < #5 > has not rated yet! \n",recommendations

Prediction for Movies That User < #5 > has not rated yet! 
     Predicted Rating
317          2.013651
355          2.007654
295          1.945572
592          1.783942
259          1.771092
rec = [id_2_movie[i] for i in recommendations.index]
rec = pd.DataFrame(rec,columns=['Recommndation'])
print "Top Five Movies Recommendation for user < #5 > \n",rec
Top Five Movies Recommendation for user < #5 > 
                  Recommndation
0  American Rhapsody, An (2001)
1               Red Heat (1988)
2         Ghosts of Mars (2001)
3                  Lucas (1986)
4           Perfect Blue (1997)

Recomendation using content based filtering

movie_based = pd.DataFrame(movie_based_prediction)
predictions = movie_based.loc[5,pd.DataFrame(df_matrix).loc[5,:] == 0]
top = predictions.sort_values(ascending=False).head(n=5)
recommendations = pd.DataFrame(data=top)
recommendations.columns = ['Predicted Rating']
print "Prediction for Movies That User < #5 > has not rated yet! \n",recommendations


Prediction for Movies That User < #5 > has not rated yet! 
      Predicted Rating
3715          0.016042
3571          0.016033
3713          0.016032
2780          0.016027
2255          0.016020
rec = [id_2_movie[i] for i in recommendations.index]
rec = pd.DataFrame(rec,columns=['Recommndation'])
print "Top Five Movies Recommendation for user < #5 > \n",rec
Top Five Movies Recommendation for user < #5 > 
                  Recommndation
0  American Rhapsody, An (2001)
1               Red Heat (1988)
2         Ghosts of Mars (2001)
3                  Lucas (1986)
4           Perfect Blue (1997)

Mustapha Omotosho

Mustapha Omotosho

constant learner,machine learning enthusiast,huge Barcelona fan

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora