Project: Investigation of TMDb Movie Dataset

Table of Contents

Introduction

Welcome to my investigation of TMDb movie data set which contains information about 10,000 movies. In terms of the data set, it was cleaned from original data on Kaggle, and is about to get cleaned second time to focus and answer the main questions.

After data analysis steps of data wrangling, and EDA, I hope to provide some useful insights about the data set through answering these following questions with visualizations:

  • Question 1: Which genres are most popular overtime ?
  • Question 2: What are time series of top 5 movie genres?
  • Question 3: What are factors associated with revenue levels?

Data Wrangling

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Import statements of all packages that I use

In [3]:
df=pd.read_csv("movies.csv")
df.head()
Out[3]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

Read the dataset
In [4]:
df.shape
Out[4]:
(10866, 21)

There are 10866 rows and 21 columns in total

In [5]:
df.describe()
Out[5]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

Statistics summary I can extract to have an overall view of the dataset

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

The infor method describes dtypes of the dataset, and shows that there are missing values that I need to work on.

In [7]:
df.drop(labels = ["id", "imdb_id", "budget", "revenue", "cast", "homepage", "tagline", "overview", "release_date"], axis = 1, inplace = True)
df.head(2)
Out[7]:
popularity original_title director keywords runtime genres production_companies vote_count vote_average release_year budget_adj revenue_adj
0 32.985763 Jurassic World Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 5562 6.5 2015 1.379999e+08 1.392446e+09
1 28.419936 Mad Max: Fury Road George Miller future|chase|post-apocalyptic|dystopia|australia 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 6185 7.1 2015 1.379999e+08 3.481613e+08

Drop unnecessary columns that have no use for my dataset investigation. There are 2 budget and 2 revenue columns in which budget and revenue with "adj" are recalculated in terms of 2010 dollars, accounting inflation overtime. Therefore, I keep such last 2 columns with "adj" which are more updated and drop the other 2 without "adj".

In [8]:
df.dropna(inplace = True)

All the missing values belong to columns with dtype of string, so I decide to drop any rows with such null values.

In [9]:
df.shape
Out[9]:
(8692, 12)

The dataset is now downsized to 8692 rows and 12 columns

In [10]:
df.rename(columns={"original_title":"title", "budget_adj":"budget", "revenue_adj":"revenue"}, inplace = True)
df.head(1)
Out[10]:
popularity title director keywords runtime genres production_companies vote_count vote_average release_year budget revenue
0 32.985763 Jurassic World Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 5562 6.5 2015 1.379999e+08 1.392446e+09

Rename 3 columns in which "budget_adj" and "revenue_adj" are changed to "budget" and "revenue" respectively for easy following because now there is only 1 column of budget and one of revenue in the dataset.

In [11]:
df.hist(figsize=(12,8));

Overview of some columns in the dataset, and as can be seen, most of models are right-skewed.

In [12]:
df.isnull().sum()
Out[12]:
popularity              0
title                   0
director                0
keywords                0
runtime                 0
genres                  0
production_companies    0
vote_count              0
vote_average            0
release_year            0
budget                  0
revenue                 0
dtype: int64

Double check if there are any missing values

In [13]:
df.head(3)
Out[13]:
popularity title director keywords runtime genres production_companies vote_count vote_average release_year budget revenue
0 32.985763 Jurassic World Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 5562 6.5 2015 1.379999e+08 1.392446e+09
1 28.419936 Mad Max: Fury Road George Miller future|chase|post-apocalyptic|dystopia|australia 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 6185 7.1 2015 1.379999e+08 3.481613e+08
2 13.112507 Insurgent Robert Schwentke based on novel|revolution|dystopia|sequel|dyst... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 2480 6.3 2015 1.012000e+08 2.716190e+08

The dataset is now clean as seen above.

Exploratory Data Analysis

In [136]:
arr_genres = set([])
for genre in df["genres"].unique():
    temp = genre.split("|")
    arr_genres.update(temp)

Get all unique genres using for loop and reference from this topic (set.update())

In [137]:
for name in list(arr_genres):
    df[name] = 0

Create and add a new column for each unique genre to the dataset

In [138]:
for index, row in df.iterrows():
    genres = row["genres"].split("|")
    for genre in genres:
        df.loc[index, genre] = 1

Assign genres to their columns with reference from this link

In [139]:
df.head(3)
Out[139]:
popularity title director keywords runtime genres production_companies vote_count vote_average release_year ... Adventure Comedy TV Movie Family Thriller Romance Documentary Crime War Drama
0 32.985763 Jurassic World Colin Trevorrow monster|dna|tyrannosaurus rex|velociraptor|island 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 5562 6.5 2015 ... 1 0 0 0 1 0 0 0 0 0
1 28.419936 Mad Max: Fury Road George Miller future|chase|post-apocalyptic|dystopia|australia 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 6185 7.1 2015 ... 1 0 0 0 1 0 0 0 0 0
2 13.112507 Insurgent Robert Schwentke based on novel|revolution|dystopia|sequel|dyst... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 2480 6.3 2015 ... 1 0 0 0 1 0 0 0 0 0

3 rows × 32 columns

In [140]:
df[list(arr_genres)].sum().nlargest(10).plot(kind="bar", figsize=(10,6))
plt.xlabel("Genres")
plt.ylabel("Number of Movies")
plt.title("Top 10 Most Popular Genres Overtime");

The barchart shows that the most favorite genres that are found in the majority of movies in the dataset are drama, comedy, thriller, action, romance, horror, adventure, crime, science fiction and family.

Q2: What are time series of top 5 movie genres?

In [141]:
df.groupby("release_year")[["Drama", "Comedy", "Thriller", "Action", "Romance"]].sum().plot(figsize=(8,6))
plt.xlabel("Year")
plt.ylabel("Number of Movies")
plt.title("Time Series of Top 5 Movie Genres");

Trends of genres as seen in the time series above describe some interesting information:

  • Drama is consistently the most on-trend genre. Especially, since around 2003-2004, the number of movies of this genre rose remarkably
  • Followed by drama, comedy and thriller rank 2nd and 3rd respectively. Interestingly, thriller has gained more interest over comedy in recent years from about 2014
  • Action and romance always stand at the bottom of this top-5 list over time

Q3: What are factors associated with revenue levels?

In [142]:
df.describe().revenue
Out[142]:
count    8.692000e+03
mean     6.343420e+07
std      1.590868e+08
min      0.000000e+00
25%      0.000000e+00
50%      1.712360e+05
75%      5.493202e+07
max      2.827124e+09
Name: revenue, dtype: float64

View the min, 25%, 50%, 75%, max revenue values with Pandas describe

In [143]:
bin_edges = [0.000000e+00, 1.712360e+05, 5.493202e+07, 2.827124e+09] 

Use bin edges to cut the data into groups

In [144]:
bin_names = ['low', 'medium', 'high']

Label 3 revenue groups

In [145]:
df['revenue_levels'] = pd.cut(df['revenue'], bin_edges, labels=bin_names, include_lowest = True)
df.tail()
Out[145]:
popularity title director keywords runtime genres production_companies vote_count vote_average release_year ... Comedy TV Movie Family Thriller Romance Documentary Crime War Drama revenue_levels
10861 0.080598 The Endless Summer Bruce Brown surfer|surfboard|surfing 95 Documentary Bruce Brown Films 11 7.4 1966 ... 0 0 0 0 0 1 0 0 0 low
10862 0.065543 Grand Prix John Frankenheimer car race|racing|formula 1 176 Action|Adventure|Drama Cherokee Productions|Joel Productions|Douglas ... 20 5.7 1966 ... 0 0 0 0 0 0 0 0 1 low
10863 0.065141 Beregis Avtomobilya Eldar Ryazanov car|trolley|stealing car 94 Mystery|Comedy Mosfilm 11 6.5 1966 ... 1 0 0 0 0 0 0 0 0 low
10864 0.064317 What's Up, Tiger Lily? Woody Allen spoof 80 Action|Comedy Benedict Pictures Corp. 22 5.4 1966 ... 1 0 0 0 0 0 0 0 0 low
10865 0.035919 Manos: The Hands of Fate Harold P. Warren fire|gun|drive|sacrifice|flashlight 74 Horror Norm-Iris 15 1.5 1966 ... 0 0 0 0 0 0 0 0 0 low

5 rows × 33 columns

In [146]:
fig, axes = plt.subplots(1,3,figsize=(16,6))
labels = ["low", "medium", "high"]
axes[0].bar(labels, df.groupby("revenue_levels").vote_count.mean())
axes[0].title.set_text("Effects of Average Vote Counts on Revenue Levels")
axes[0].set_xlabel("revenue levels")
axes[0].set_ylabel("average of vote counts")

axes[1].bar(labels, df.groupby("revenue_levels").popularity.mean())
axes[1].title.set_text("Effects of Average Popularity on Revenue Levels")
axes[1].set_xlabel("revenue levels")
axes[1].set_ylabel("average of popularity")

axes[2].bar(labels, df.groupby("revenue_levels").budget.mean())
axes[2].title.set_text("Effects of Average Budget on Revenue Levels")
axes[2].set_xlabel("revenue levels")
axes[2].set_ylabel("average of budget");

There are 2 factors associated with revenue: average of vote counts and average of popularity. In both cases, the higher the factors, the higher revenue.

In [147]:
df.groupby("director")["revenue"].sum().sort_values(ascending=False).nlargest(10).plot(kind="bar")
plt.xlabel("directors")
plt.ylabel("revenue")
plt.title("Top 10 Directors with Highest Total Movie Revenues");
In [148]:
df.nlargest(100, columns="revenue" )["director"].value_counts()
Out[148]:
Steven Spielberg                7
Peter Jackson                   6
Michael Bay                     4
David Yates                     4
George Lucas                    4
                               ..
Garry Marshall                  1
Steve Martino|Mike Thurmeier    1
Mike Newell                     1
Ron Howard                      1
Andrew Adamson                  1
Name: director, Length: 64, dtype: int64

Director could be considered as another factor affecting revenues of movies

  • The barchart reveals that Steven Spielberg is director of movies having highest total revenues.
  • Using value_counts and nlargest, I find out that the majority of movies (7 movies) in top 100 having the highest revenues also belonged to Steven Spielberg.
  • Other names appear in both list are Peter jackson, George Lucas, Ron Howard and Michael Bay.

Conclusions

To sum up, the most popular genres overtime are drama, comedy, thriller, action, romance, horror, adventure, crime, science fiction and family. Tastes of movie genres may change slightly from year to year, but drama is still the all-time favorite genre.

When it comes to factors affecting movie revenues, averages of popularity, vote counts and budget present clearly positive correlations with revenue. Another factor of director also draws some attention and might be associated with revenue.