Movie Recommender System
- 15 minsRecomender System/Engine
A recommender system or a recommendation system is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item.Recommender systems are the core information filtering system designed to predict the user preference and help to recommend correct items to create a user-specific personalization experience.
Types /Approaches
There are two types of recommendation systems:
- content-based filtering
- collaborative filtering
collaborative filtering
Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself
content-based filtering
Content-based filtering methods are based on a description of the item and a profile of the user’s preferences. In a content-based recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended
Data source :
https://grouplens.org/datasets/movielens/
The Notebook
Notebook for this work can be found at Recomender System Ipython notebook
import libraries
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances
df_rating = pd.read_json('movies_rating.json')
df_rating = df_rating[['userId','movieId','rating']]
df_mv = pd.read_csv('movies.csv')
print df_rating.shape
print df_mv.shape
(84730, 3)
(9125, 3)
new_index = range(len(df_rating))
df_rating = df_rating.reindex(new_index)
df_rating.dropna(inplace=True)
print df_rating.shape
(72129, 3)
convert userID and movieID to integer
df_rating.userId = df_rating.userId.astype(int)
df_rating.movieId = df_rating.movieId.astype(int)
print df_rating.head(10)
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 1061 3.0
3 1 1129 2.0
4 1 1172 4.0
5 1 1263 2.0
6 1 1287 2.0
7 1 1293 2.0
8 1 1339 3.5
9 1 1343 2.0
checknumber of users and number of movies
num_users = df_rating.userId.unique().shape[0]
num_movies = df_rating.movieId.unique().shape[0]
print '\nNumber of users = ' + str(num_users) + ' | Number of movies = ' + str(num_movies)
Number of users = 566 | Number of movies = 5411
max_movie_id = df_rating.movieId.max()
min_movie_id = df_rating.movieId.min()
max_user_id = df_rating.userId.max()
min_user_id = df_rating.userId.min()
print "Max MovieID : %d | Min MovieID : %d " %(max_movie_id,min_movie_id)
print "Max UserID : %d | Min UserID : %d " %(max_user_id,min_user_id)
Max MovieID : 9000 | Min MovieID : 1
Max UserID : 566 | Min UserID : 1
#
check first ten movies title
df_movies.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
5 | 6 | Heat (1995) | Action|Crime|Thriller |
6 | 7 | Sabrina (1995) | Comedy|Romance |
7 | 8 | Tom and Huck (1995) | Adventure|Children |
8 | 9 | Sudden Death (1995) | Action |
9 | 10 | GoldenEye (1995) | Action|Adventure|Thriller |
10 | 11 | American President, The (1995) | Comedy|Drama|Romance |
11 | 12 | Dracula: Dead and Loving It (1995) | Comedy|Horror |
12 | 13 | Balto (1995) | Adventure|Animation|Children |
13 | 14 | Nixon (1995) | Drama |
14 | 15 | Cutthroat Island (1995) | Action|Adventure|Romance |
15 | 16 | Casino (1995) | Crime|Drama |
16 | 17 | Sense and Sensibility (1995) | Drama|Romance |
17 | 18 | Four Rooms (1995) | Comedy |
18 | 19 | Ace Ventura: When Nature Calls (1995) | Comedy |
19 | 20 | Money Train (1995) | Action|Comedy|Crime|Drama|Thriller |
20 | 21 | Get Shorty (1995) | Comedy|Crime|Thriller |
21 | 22 | Copycat (1995) | Crime|Drama|Horror|Mystery|Thriller |
22 | 23 | Assassins (1995) | Action|Crime|Thriller |
23 | 24 | Powder (1995) | Drama|Sci-Fi |
24 | 25 | Leaving Las Vegas (1995) | Drama|Romance |
25 | 26 | Othello (1995) | Drama |
26 | 27 | Now and Then (1995) | Children|Drama |
27 | 28 | Persuasion (1995) | Drama|Romance |
28 | 29 | City of Lost Children, The (Cité des enfants p... | Adventure|Drama|Fantasy|Mystery|Sci-Fi |
29 | 30 | Shanghai Triad (Yao a yao yao dao waipo qiao) ... | Crime|Drama |
... | ... | ... | ... |
9095 | 159690 | Teenage Mutant Ninja Turtles: Out of the Shado... | Action|Adventure|Comedy |
9096 | 159755 | Popstar: Never Stop Never Stopping (2016) | Comedy |
9097 | 159858 | The Conjuring 2 (2016) | Horror |
9098 | 159972 | Approaching the Unknown (2016) | Drama|Sci-Fi|Thriller |
9099 | 160080 | Ghostbusters (2016) | Action|Comedy|Horror|Sci-Fi |
9100 | 160271 | Central Intelligence (2016) | Action|Comedy |
9101 | 160438 | Jason Bourne (2016) | Action |
9102 | 160440 | The Maid's Room (2014) | Thriller |
9103 | 160563 | The Legend of Tarzan (2016) | Action|Adventure |
9104 | 160565 | The Purge: Election Year (2016) | Action|Horror|Sci-Fi |
9105 | 160567 | Mike & Dave Need Wedding Dates (2016) | Comedy |
9106 | 160590 | Survive and Advance (2013) | (no genres listed) |
9107 | 160656 | Tallulah (2016) | Drama |
9108 | 160718 | Piper (2016) | Animation |
9109 | 160954 | Nerve (2016) | Drama|Thriller |
9110 | 161084 | My Friend Rockefeller (2015) | Documentary |
9111 | 161155 | Sunspring (2016) | Sci-Fi |
9112 | 161336 | Author: The JT LeRoy Story (2016) | Documentary |
9113 | 161582 | Hell or High Water (2016) | Crime|Drama |
9114 | 161594 | Kingsglaive: Final Fantasy XV (2016) | Action|Adventure|Animation|Drama|Fantasy|Sci-Fi |
9115 | 161830 | Body (2015) | Drama|Horror|Thriller |
9116 | 161918 | Sharknado 4: The 4th Awakens (2016) | Action|Adventure|Horror|Sci-Fi |
9117 | 161944 | The Last Brickmaker in America (2001) | Drama |
9118 | 162376 | Stranger Things | Drama |
9119 | 162542 | Rustom (2016) | Romance|Thriller |
9120 | 162672 | Mohenjo Daro (2016) | Adventure|Drama|Romance |
9121 | 163056 | Shin Godzilla (2016) | Action|Adventure|Fantasy|Sci-Fi |
9122 | 163949 | The Beatles: Eight Days a Week - The Touring Y... | Documentary |
9123 | 164977 | The Gay Desperado (1936) | Comedy |
9124 | 164979 | Women of '69, Unboxed | Documentary |
9125 rows × 3 columns
check movies unique title
len(df_mv.title.unique())
9123
len(df_mv.movieId)
9125
id_2_movie = {}
for movie_id in df_mv.movieId:
index = df_mv.index[movie_id-1]
if index > max_movie_id:
break
id_2_movie[movie_id] = df_mv.loc[index].title
Create user-movies similarity matrices
cnt = 0
for line in df_rating.itertuples():
print line
cnt +=1
if cnt > 20:
break
Pandas(Index=0, userId=1, movieId=31, rating=2.5)
Pandas(Index=1, userId=1, movieId=1029, rating=3.0)
Pandas(Index=2, userId=1, movieId=1061, rating=3.0)
Pandas(Index=3, userId=1, movieId=1129, rating=2.0)
Pandas(Index=4, userId=1, movieId=1172, rating=4.0)
Pandas(Index=5, userId=1, movieId=1263, rating=2.0)
Pandas(Index=6, userId=1, movieId=1287, rating=2.0)
Pandas(Index=7, userId=1, movieId=1293, rating=2.0)
Pandas(Index=8, userId=1, movieId=1339, rating=3.5)
Pandas(Index=9, userId=1, movieId=1343, rating=2.0)
Pandas(Index=10, userId=1, movieId=1371, rating=2.5)
Pandas(Index=11, userId=1, movieId=1405, rating=1.0)
Pandas(Index=12, userId=1, movieId=1953, rating=4.0)
Pandas(Index=13, userId=1, movieId=2105, rating=4.0)
Pandas(Index=14, userId=1, movieId=2150, rating=3.0)
Pandas(Index=15, userId=1, movieId=2193, rating=2.0)
Pandas(Index=16, userId=1, movieId=2294, rating=2.0)
Pandas(Index=17, userId=1, movieId=2455, rating=2.5)
Pandas(Index=18, userId=1, movieId=2968, rating=1.0)
Pandas(Index=19, userId=1, movieId=3671, rating=3.0)
Pandas(Index=20, userId=2, movieId=10, rating=4.0)
df_matrix = np.zeros((num_users, max_movie_id))
for line in df_rating.itertuples():
df_matrix[line[1]-1, line[2]-1] = line[3]
user_similarity = pairwise_distances(df_matrix, metric='cosine')
movie_similarity = pairwise_distances(df_matrix.T, metric='cosine')
user_similarity[:5]
array([[0. , 1. , 1. , ..., 0.91832729, 1. ,
1. ],
[1. , 0. , 0.8665161 , ..., 0.82637126, 0.90441205,
1. ],
[1. , 0.8665161 , 0. , ..., 0.87694711, 0.92983349,
0.91836046],
[0.92551755, 0.88117897, 0.91232459, ..., 0.69537101, 0.9291931 ,
1. ],
[0.98197165, 0.88889461, 0.82555476, ..., 0.88217131, 0.96200923,
1. ]])
movie_similarity[:5]
array([[0. , 0.59828933, 0.7080403 , ..., 0.87482476, 0.9309266 ,
0.9309266 ],
[0.59828933, 0. , 0.80294811, ..., 0.95590021, 1. ,
1. ],
[0.7080403 , 0.80294811, 0. , ..., 1. , 1. ,
1. ],
[0.86907562, 0.81043032, 0.80491616, ..., 1. , 1. ,
1. ],
[0.74970641, 0.74490917, 0.64807423, ..., 0.86215254, 0.85079341,
0.85079341]])
# Top 10 similar users for user id 200
print "users similar to user id 200: \n", pd.DataFrame(user_similarity).loc[199,pd.DataFrame(user_similarity).loc[199,:] > 0].sort_values(ascending=False)[0:10]
users similar to user id 200:
565 1.0
548 1.0
34 1.0
70 1.0
75 1.0
134 1.0
157 1.0
275 1.0
309 1.0
324 1.0
Name: 199, dtype: float64
# Top 5 similar movie for movie id 9
print "movie similar to movie id 9: \n", pd.DataFrame(movie_similarity).loc[9,pd.DataFrame(movie_similarity).loc[9,:] > 0].sort_values(ascending=False)[0:5]
movie similar to movie id 9:
4500 1.0
5085 1.0
5072 1.0
5075 1.0
5076 1.0
Name: 9, dtype: float64
mv = pd.DataFrame(movie_similarity).loc[9,pd.DataFrame(movie_similarity).loc[9,:] > 0].sort_values(ascending=False)[0:5]
print "\t\t\t movie | %s | is similar to " % id_2_movie[31]
for i in mv.index:
print id_2_movie[i]
movie | Dangerous Minds (1995) | is similar to
Man Who Fell to Earth, The (1976)
Miracle (2004)
Boy and His Dog, A (1975)
Tormented (1960)
Chitty Chitty Bang Bang (1968)
Building the Recommendation engine
# Function for item based rating prediction
def movie_based_prediction(rating_matrix, similarity_matrix):
return rating_matrix.dot(similarity_matrix) / np.array([np.abs(similarity_matrix).sum(axis=1)])
# Function for user based rating prediction
def user_based_prediction(rating_matrix, similarity_matrix):
mean_user_rating = rating_matrix.mean(axis=1)
ratings_diff = (rating_matrix - mean_user_rating[:, np.newaxis])
return mean_user_rating[:, np.newaxis] + similarity_matrix.dot(ratings_diff) / np.array([np.abs(similarity_matrix).sum(axis=1)]).T
movie_based_prediction = movie_based_prediction(df_matrix, movie_similarity)
user_based_prediction = user_based_prediction(df_matrix, user_similarity)
Recomendation using collaborative filtering
user_based = pd.DataFrame(user_based_prediction)
predictions = user_based.loc[5,pd.DataFrame(df_matrix).loc[5,:] == 0]
top = predictions.sort_values(ascending=False).head(n=5)
recommendations = pd.DataFrame(data=top)
recommendations.columns = ['Predicted Rating']
print "Prediction for Movies That User < #5 > has not rated yet! \n",recommendations
Prediction for Movies That User < #5 > has not rated yet!
Predicted Rating
317 2.013651
355 2.007654
295 1.945572
592 1.783942
259 1.771092
rec = [id_2_movie[i] for i in recommendations.index]
rec = pd.DataFrame(rec,columns=['Recommndation'])
print "Top Five Movies Recommendation for user < #5 > \n",rec
Top Five Movies Recommendation for user < #5 >
Recommndation
0 American Rhapsody, An (2001)
1 Red Heat (1988)
2 Ghosts of Mars (2001)
3 Lucas (1986)
4 Perfect Blue (1997)
Recomendation using content based filtering
movie_based = pd.DataFrame(movie_based_prediction)
predictions = movie_based.loc[5,pd.DataFrame(df_matrix).loc[5,:] == 0]
top = predictions.sort_values(ascending=False).head(n=5)
recommendations = pd.DataFrame(data=top)
recommendations.columns = ['Predicted Rating']
print "Prediction for Movies That User < #5 > has not rated yet! \n",recommendations
Prediction for Movies That User < #5 > has not rated yet!
Predicted Rating
3715 0.016042
3571 0.016033
3713 0.016032
2780 0.016027
2255 0.016020
rec = [id_2_movie[i] for i in recommendations.index]
rec = pd.DataFrame(rec,columns=['Recommndation'])
print "Top Five Movies Recommendation for user < #5 > \n",rec
Top Five Movies Recommendation for user < #5 >
Recommndation
0 American Rhapsody, An (2001)
1 Red Heat (1988)
2 Ghosts of Mars (2001)
3 Lucas (1986)
4 Perfect Blue (1997)