Movie Recommender System

Friday. April 27, 2018 - 15 mins

Recomender System/Engine

A recommender system or a recommendation system is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item.Recommender systems are the core information filtering system designed to predict the user preference and help to recommend correct items to create a user-specific personalization experience.

Types /Approaches

There are two types of recommendation systems:

content-based filtering
collaborative filtering

collaborative filtering

Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself

content-based filtering

Content-based filtering methods are based on a description of the item and a profile of the user’s preferences. In a content-based recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended

Data source :

https://grouplens.org/datasets/movielens/

The Notebook

Notebook for this work can be found at Recomender System Ipython notebook

import libraries

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances

df_rating = pd.read_json('movies_rating.json')
df_rating = df_rating[['userId','movieId','rating']]

df_mv = pd.read_csv('movies.csv')

print df_rating.shape

print df_mv.shape

(84730, 3)
(9125, 3)

new_index = range(len(df_rating))

df_rating = df_rating.reindex(new_index)

df_rating.dropna(inplace=True)

print df_rating.shape

(72129, 3)

convert userID and movieID to integer

df_rating.userId = df_rating.userId.astype(int)
df_rating.movieId = df_rating.movieId.astype(int)

print df_rating.head(10)

   userId  movieId  rating
     1       31     2.5
     1     1029     3.0
     1     1061     3.0
     1     1129     2.0
     1     1172     4.0
     1     1263     2.0
     1     1287     2.0
     1     1293     2.0
     1     1339     3.5
     1     1343     2.0

checknumber of users and number of movies

num_users = df_rating.userId.unique().shape[0]
num_movies = df_rating.movieId.unique().shape[0]

print '\nNumber of users = ' + str(num_users) + ' | Number of movies = ' + str(num_movies)

Number of users = 566 | Number of movies = 5411

max_movie_id = df_rating.movieId.max()
min_movie_id = df_rating.movieId.min()

max_user_id = df_rating.userId.max()
min_user_id  = df_rating.userId.min()

print "Max MovieID : %d | Min MovieID : %d " %(max_movie_id,min_movie_id)
print "Max UserID : %d | Min UserID : %d " %(max_user_id,min_user_id)

Max MovieID : 9000 | Min MovieID : 1 
Max UserID : 566 | Min UserID : 1 

check first ten movies title

df_movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller
10	11	American President, The (1995)	Comedy\|Drama\|Romance
11	12	Dracula: Dead and Loving It (1995)	Comedy\|Horror
12	13	Balto (1995)	Adventure\|Animation\|Children
13	14	Nixon (1995)	Drama
14	15	Cutthroat Island (1995)	Action\|Adventure\|Romance
15	16	Casino (1995)	Crime\|Drama
16	17	Sense and Sensibility (1995)	Drama\|Romance
17	18	Four Rooms (1995)	Comedy
18	19	Ace Ventura: When Nature Calls (1995)	Comedy
19	20	Money Train (1995)	Action\|Comedy\|Crime\|Drama\|Thriller
20	21	Get Shorty (1995)	Comedy\|Crime\|Thriller
21	22	Copycat (1995)	Crime\|Drama\|Horror\|Mystery\|Thriller
22	23	Assassins (1995)	Action\|Crime\|Thriller
23	24	Powder (1995)	Drama\|Sci-Fi
24	25	Leaving Las Vegas (1995)	Drama\|Romance
25	26	Othello (1995)	Drama
26	27	Now and Then (1995)	Children\|Drama
27	28	Persuasion (1995)	Drama\|Romance
28	29	City of Lost Children, The (Cité des enfants p...	Adventure\|Drama\|Fantasy\|Mystery\|Sci-Fi
29	30	Shanghai Triad (Yao a yao yao dao waipo qiao) ...	Crime\|Drama
...	...	...	...
9095	159690	Teenage Mutant Ninja Turtles: Out of the Shado...	Action\|Adventure\|Comedy
9096	159755	Popstar: Never Stop Never Stopping (2016)	Comedy
9097	159858	The Conjuring 2 (2016)	Horror
9098	159972	Approaching the Unknown (2016)	Drama\|Sci-Fi\|Thriller
9099	160080	Ghostbusters (2016)	Action\|Comedy\|Horror\|Sci-Fi
9100	160271	Central Intelligence (2016)	Action\|Comedy
9101	160438	Jason Bourne (2016)	Action
9102	160440	The Maid's Room (2014)	Thriller
9103	160563	The Legend of Tarzan (2016)	Action\|Adventure
9104	160565	The Purge: Election Year (2016)	Action\|Horror\|Sci-Fi
9105	160567	Mike & Dave Need Wedding Dates (2016)	Comedy
9106	160590	Survive and Advance (2013)	(no genres listed)
9107	160656	Tallulah (2016)	Drama
9108	160718	Piper (2016)	Animation
9109	160954	Nerve (2016)	Drama\|Thriller
9110	161084	My Friend Rockefeller (2015)	Documentary
9111	161155	Sunspring (2016)	Sci-Fi
9112	161336	Author: The JT LeRoy Story (2016)	Documentary
9113	161582	Hell or High Water (2016)	Crime\|Drama
9114	161594	Kingsglaive: Final Fantasy XV (2016)	Action\|Adventure\|Animation\|Drama\|Fantasy\|Sci-Fi
9115	161830	Body (2015)	Drama\|Horror\|Thriller
9116	161918	Sharknado 4: The 4th Awakens (2016)	Action\|Adventure\|Horror\|Sci-Fi
9117	161944	The Last Brickmaker in America (2001)	Drama
9118	162376	Stranger Things	Drama
9119	162542	Rustom (2016)	Romance\|Thriller
9120	162672	Mohenjo Daro (2016)	Adventure\|Drama\|Romance
9121	163056	Shin Godzilla (2016)	Action\|Adventure\|Fantasy\|Sci-Fi
9122	163949	The Beatles: Eight Days a Week - The Touring Y...	Documentary
9123	164977	The Gay Desperado (1936)	Comedy
9124	164979	Women of '69, Unboxed	Documentary

9125 rows × 3 columns

check movies unique title

len(df_mv.title.unique())

len(df_mv.movieId)

id_2_movie = {}

for movie_id in df_mv.movieId:
  index = df_mv.index[movie_id-1]
  if index > max_movie_id:
    break
  id_2_movie[movie_id] = df_mv.loc[index].title

Create user-movies similarity matrices

cnt = 0
for line in df_rating.itertuples():
    print line
    cnt +=1
    if cnt > 20:
        break

Pandas(Index=0, userId=1, movieId=31, rating=2.5)
Pandas(Index=1, userId=1, movieId=1029, rating=3.0)
Pandas(Index=2, userId=1, movieId=1061, rating=3.0)
Pandas(Index=3, userId=1, movieId=1129, rating=2.0)
Pandas(Index=4, userId=1, movieId=1172, rating=4.0)
Pandas(Index=5, userId=1, movieId=1263, rating=2.0)
Pandas(Index=6, userId=1, movieId=1287, rating=2.0)
Pandas(Index=7, userId=1, movieId=1293, rating=2.0)
Pandas(Index=8, userId=1, movieId=1339, rating=3.5)
Pandas(Index=9, userId=1, movieId=1343, rating=2.0)
Pandas(Index=10, userId=1, movieId=1371, rating=2.5)
Pandas(Index=11, userId=1, movieId=1405, rating=1.0)
Pandas(Index=12, userId=1, movieId=1953, rating=4.0)
Pandas(Index=13, userId=1, movieId=2105, rating=4.0)
Pandas(Index=14, userId=1, movieId=2150, rating=3.0)
Pandas(Index=15, userId=1, movieId=2193, rating=2.0)
Pandas(Index=16, userId=1, movieId=2294, rating=2.0)
Pandas(Index=17, userId=1, movieId=2455, rating=2.5)
Pandas(Index=18, userId=1, movieId=2968, rating=1.0)
Pandas(Index=19, userId=1, movieId=3671, rating=3.0)
Pandas(Index=20, userId=2, movieId=10, rating=4.0)

df_matrix = np.zeros((num_users, max_movie_id))
for line in df_rating.itertuples():
    df_matrix[line[1]-1, line[2]-1] = line[3]

user_similarity = pairwise_distances(df_matrix, metric='cosine')

movie_similarity = pairwise_distances(df_matrix.T, metric='cosine')

user_similarity[:5]

array([[0.        , 1.        , 1.        , ..., 0.91832729, 1.        ,
        1.        ],
       [1.        , 0.        , 0.8665161 , ..., 0.82637126, 0.90441205,
        1.        ],
       [1.        , 0.8665161 , 0.        , ..., 0.87694711, 0.92983349,
        0.91836046],
       [0.92551755, 0.88117897, 0.91232459, ..., 0.69537101, 0.9291931 ,
        1.        ],
       [0.98197165, 0.88889461, 0.82555476, ..., 0.88217131, 0.96200923,
        1.        ]])

movie_similarity[:5]

array([[0.        , 0.59828933, 0.7080403 , ..., 0.87482476, 0.9309266 ,
        0.9309266 ],
       [0.59828933, 0.        , 0.80294811, ..., 0.95590021, 1.        ,
        1.        ],
       [0.7080403 , 0.80294811, 0.        , ..., 1.        , 1.        ,
        1.        ],
       [0.86907562, 0.81043032, 0.80491616, ..., 1.        , 1.        ,
        1.        ],
       [0.74970641, 0.74490917, 0.64807423, ..., 0.86215254, 0.85079341,
        0.85079341]])

# Top 10 similar users for user id 200
print "users similar to user id 200: \n", pd.DataFrame(user_similarity).loc[199,pd.DataFrame(user_similarity).loc[199,:] > 0].sort_values(ascending=False)[0:10]

users similar to user id 200: 
  1.0
  1.0
   1.0
   1.0
   1.0
  1.0
  1.0
  1.0
  1.0
  1.0
Name: 199, dtype: float64

# Top 5 similar movie for movie id 9
print "movie similar to movie id 9: \n", pd.DataFrame(movie_similarity).loc[9,pd.DataFrame(movie_similarity).loc[9,:] > 0].sort_values(ascending=False)[0:5]

movie similar to movie id 9: 
  1.0
  1.0
  1.0
  1.0
  1.0
Name: 9, dtype: float64

mv = pd.DataFrame(movie_similarity).loc[9,pd.DataFrame(movie_similarity).loc[9,:] > 0].sort_values(ascending=False)[0:5]

print "\t\t\t movie |  %s  | is similar to " % id_2_movie[31]
for i in mv.index:
  print id_2_movie[i]

       movie |  Dangerous Minds (1995)  | is similar to 
Man Who Fell to Earth, The (1976)
Miracle (2004)
Boy and His Dog, A (1975)
Tormented (1960)
Chitty Chitty Bang Bang (1968)

Building the Recommendation engine

# Function for item based rating prediction
def movie_based_prediction(rating_matrix, similarity_matrix):
    return rating_matrix.dot(similarity_matrix) / np.array([np.abs(similarity_matrix).sum(axis=1)])

# Function for user based rating prediction
def user_based_prediction(rating_matrix, similarity_matrix):
    mean_user_rating = rating_matrix.mean(axis=1)
    ratings_diff = (rating_matrix - mean_user_rating[:, np.newaxis])
    return mean_user_rating[:, np.newaxis] + similarity_matrix.dot(ratings_diff) / np.array([np.abs(similarity_matrix).sum(axis=1)]).T

movie_based_prediction = movie_based_prediction(df_matrix, movie_similarity)
user_based_prediction = user_based_prediction(df_matrix, user_similarity)

Recomendation using collaborative filtering

user_based = pd.DataFrame(user_based_prediction)
predictions = user_based.loc[5,pd.DataFrame(df_matrix).loc[5,:] == 0]
top = predictions.sort_values(ascending=False).head(n=5)
recommendations = pd.DataFrame(data=top)
recommendations.columns = ['Predicted Rating']
print "Prediction for Movies That User < #5 > has not rated yet! \n",recommendations

Prediction for Movies That User < #5 > has not rated yet! 
     Predicted Rating
        2.013651
        2.007654
        1.945572
        1.783942
        1.771092

rec = [id_2_movie[i] for i in recommendations.index]

rec = pd.DataFrame(rec,columns=['Recommndation'])

print "Top Five Movies Recommendation for user < #5 > \n",rec

Top Five Movies Recommendation for user < #5 > 
                  Recommndation
American Rhapsody, An (2001)
             Red Heat (1988)
       Ghosts of Mars (2001)
                Lucas (1986)
         Perfect Blue (1997)

Recomendation using content based filtering

movie_based = pd.DataFrame(movie_based_prediction)
predictions = movie_based.loc[5,pd.DataFrame(df_matrix).loc[5,:] == 0]
top = predictions.sort_values(ascending=False).head(n=5)
recommendations = pd.DataFrame(data=top)
recommendations.columns = ['Predicted Rating']
print "Prediction for Movies That User < #5 > has not rated yet! \n",recommendations

Prediction for Movies That User < #5 > has not rated yet! 
      Predicted Rating
        0.016042
        0.016033
        0.016032
        0.016027
        0.016020

rec = [id_2_movie[i] for i in recommendations.index]

rec = pd.DataFrame(rec,columns=['Recommndation'])

print "Top Five Movies Recommendation for user < #5 > \n",rec

Top Five Movies Recommendation for user < #5 > 
                  Recommndation
American Rhapsody, An (2001)
             Red Heat (1988)
       Ghosts of Mars (2001)
                Lucas (1986)
         Perfect Blue (1997)

Mustapha Omotosho

constant learner,machine learning enthusiast,huge Barcelona fan