IMDB 5000 Movies
Background
How can we tell the greatness of a movie before it is released in cinema?
This question puzzled me for a long time since there is no universal way to claim the goodness of movies. Many people rely on critics to gauge the quality of a film, while others use their instincts. But it takes the time to obtain a reasonable amount of critics review after a movie is released. And human instinct sometimes is unreliable.
Question
- Given that thousands of movies were produced each year, is there a better way for us to tell the greatness of movie without relying on critics or our own instincts?
- Will the number of human faces in movie poster correlate with the movie rating?
Method
To answer this question, I scraped 5000+ movies from IMDB website using a Python library called "scrapy".
The scraping process took 2 hours to finish. In the end, I was able to obtain all needed 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. Below are the 28 variables:
- "movie_title"
- "color"
- "num_critic_for_reviews"
- "movie_facebook_likes"
- "duration"
- "director_name"
- "director_facebook_likes"
- "actor_3_name"
- "actor_3_facebook_likes"
- "actor_2_name"
- "actor_2_facebook_likes"
- "actor_1_name"
- "actor_1_facebook_likes"
- "gross"
- "genres"
- "num_voted_users"
- "cast_total_facebook_likes"
- "facenumber_in_poster"
- "plot_keywords"
- "movie_imdb_link"
- "num_user_for_reviews"
- "language"
- "country"
- "content_rating"
- "budget"
- "title_year"
- "imdb_score"
- "aspect_ratio" To answer question 2, I applied the human face detection algorithm on all the posters using python library called dlib, and extracted the number of faces in posters.
탐색적 분석¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline
- How people can know rating(like greatness of moives) before watching a moive
- without critics or our own instincts? It's the point of this Analysis
movies = pd.read_csv("data/movie_metadata.csv")
movies.head()
movies.columns
movies['content_rating'].value_counts() # content rating -> 상영등급.
상영등급별 분포¶
sns.factorplot('content_rating',kind='count',data=movies, size=8)
plt.xticks(rotation=45)
최대,최소 평점¶
movies['imdb_score'].max() # 10 is highest, maximun of rating is 9.5 in this db
movies['imdb_score'].min() # 0 is lowest, minimun of rating is 1.6 in this db
sns.factorplot(y='imdb_score',kind='box',data=movies)
우선 영향도를 보기 위해서 평점을 반올림을 통해서 구분이 잘되도록 변환¶
df = movies.copy()
df['imdb_score'] = df['imdb_score'].apply(lambda x:int(round(x)))
df['imdb_score'].value_counts()
sns.factorplot('imdb_score',kind='count',data=df, size=6)
IMDB Score VS Country¶
tmp = df['language'].value_counts()
language_list = tmp[tmp>3].index.tolist()
sns.boxplot(y='imdb_score',x='language',data=df[df['language'].isin(language_list)])
plt.xticks(rotation=45)
IMDB Score vs Movie Year¶
- 년도가 높아 질수록 영화의 평점 수는 갭도 커지면서, 아마 엔터 위주의 영화만 출시 되는 경향이 있는 듯하다.
- 년도가 높을 수록 Score는 낮다.
title_year_df = df[['title_year','imdb_score']]
title_year_df = title_year_df.dropna()
title_year_df['title_year'] = title_year_df['title_year'].astype(int)
sns.factorplot(y='imdb_score',x='title_year',data=title_year_df,kind='box', size=10)
plt.xticks(rotation=45)
IMDB Score vs Facebook Popularity¶
- Number of facebook likes effect high rating of imdb_score
- Correlation : 0.24
sns.boxplot(x='imdb_score',y='movie_facebook_likes',data=df)
df[['movie_facebook_likes','imdb_score']].corr()
Correlation analysis With Continuous Varibles¶
columns = df.columns
discrete = []
continuous = []
for i in columns:
if df[i].dtype =='object':
discrete.append(i)
else:
continuous.append(i)
# sns.pairplot(df[continuous].dropna(),kind="reg")
df_corr = df[continuous].dropna().corr()
연속변수 상의 Correlation¶
- Number of Critic for reviews
- Duration
- Gross
- Number of User Reviews
- Number of voted Users 가 긍정적인 영향을 미쳤다.
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(df_corr,
xticklabels=df_corr.columns.values,
yticklabels=df_corr.columns.values,
annot=True, linewidths=.5, ax=ax)
df['facenumber_in_poster'].head()
df2 = df.dropna().copy()
df2['facenumber_in_poster'] = df2['facenumber_in_poster'].astype(int)
sns.factorplot('facenumber_in_poster',kind='count',data=df2, size=6)
realtionOfFN = df[['facenumber_in_poster','imdb_score']].dropna().corr()
realtionOfFN
Histogram of IMDB scores(0~10)¶
sns.set(rc={"figure.figsize": (8, 6)});
sns.distplot(movies['imdb_score'])
Delete columns don't need¶
del df['movie_imdb_link']
del df['color']
Relation with director_name and imdb_score¶
- A : A little relation with each other but... not a lot
number of directors¶
- 2399
len(df['director_name'].unique())
tmp_X = pd.get_dummies(df['director_name'])
tmp_X['imdb_score'] = df['imdb_score']
df_corr = tmp_X.corr()
df_corr['imdb_score'][:5]
corr = df_corr['imdb_score']
corr[corr>0.05]
director_name_list = corr[corr>0.05].index.tolist()
director_name_list
df[df['director_name']=='Steven Spielberg'][['imdb_score','movie_title']]
df['director_name'] = df['director_name'].astype(object)
sns.boxplot(y='imdb_score',x='director_name',data=df[df['director_name'].isin(director_name_list)])
plt.xticks(rotation=45)
3. Relation with Genres¶
- Bad Effect : horror & comedy have a bad effection to rating of moives
- Good Effect : Biography, Documentary, Drama, History, War have good factors for rating of movies
genre_list = df['genres'].str.split('|')
genre_list[:5]
# set 으로 활용하면 되려나 (사용은 안함.)
genre = set()
for i in range(len(genre_list)):
genre |= set(genre_list[i])
genre
worst Way to make difference length list to DataFrame¶
genre_df = pd.DataFrame()
for i in range(len(genre_list)):
genre_df = genre_df.append(pd.DataFrame(genre_list[i]).T)
genre_df.head()
- Way 1.
genre_df = pd.DataFrame(genre_list.values.tolist(), index=genre_list.index)
genre_df.head()
genre_df = pd.DataFrame(genre_list.values.tolist(), index=genre_list.index).replace({None:np.nan})
genre_df.head()
- Way 2
genre_df = genre_list.apply(pd.Series)
genre_df.head()
Making dummy data¶
- Count Varibles like dummy coding
- way 1
pd.get_dummies(genre_df.stack()).groupby(level=0).max().head()
- Way 2
genre_df.stack().groupby(level=0).value_counts().unstack(fill_value=0).head() # int
genre_df.stack().groupby(level=0).value_counts().unstack().fillna(0).head() # float
- Way 3
genre_df.apply(pd.value_counts, 1).fillna(0).astype(int).head()
df_genre = genre_df.stack().groupby(level=0).value_counts().unstack(fill_value=0)
df_genre['imdb_score'] = df['imdb_score']
df_genre.head()
genre_corr = df_genre.corr()
genre_corr = genre_corr['imdb_score']
genre_corr[abs(genre_corr)>0.1]