Project: Investigation of TMDb Movie Dataset¶

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction¶

Welcome to my investigation of TMDb movie data set which contains information about 10,000 movies. In terms of the data set, it was cleaned from original data on Kaggle, and is about to get cleaned second time to focus and answer the main questions.

After data analysis steps of data wrangling, and EDA, I hope to provide some useful insights about the data set through answering these following questions with visualizations:

Question 1: Which genres are most popular overtime ?
Question 2: What are time series of top 5 movie genres?
Question 3: What are factors associated with revenue levels?

Data Wrangling¶

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Import statements of all packages that I use

df=pd.read_csv("movies.csv")
df.head()

df.shape

(10866, 21)

There are 10866 rows and 21 columns in total

df.describe()

Statistics summary I can extract to have an overall view of the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

The infor method describes dtypes of the dataset, and shows that there are missing values that I need to work on.

df.drop(labels = ["id", "imdb_id", "budget", "revenue", "cast", "homepage", "tagline", "overview", "release_date"], axis = 1, inplace = True)
df.head(2)

Drop unnecessary columns that have no use for my dataset investigation. There are 2 budget and 2 revenue columns in which budget and revenue with "adj" are recalculated in terms of 2010 dollars, accounting inflation overtime. Therefore, I keep such last 2 columns with "adj" which are more updated and drop the other 2 without "adj".

df.dropna(inplace = True)

All the missing values belong to columns with dtype of string, so I decide to drop any rows with such null values.

df.shape

(8692, 12)

The dataset is now downsized to 8692 rows and 12 columns

df.rename(columns={"original_title":"title", "budget_adj":"budget", "revenue_adj":"revenue"}, inplace = True)
df.head(1)

Rename 3 columns in which "budget_adj" and "revenue_adj" are changed to "budget" and "revenue" respectively for easy following because now there is only 1 column of budget and one of revenue in the dataset.

df.hist(figsize=(12,8));

Overview of some columns in the dataset, and as can be seen, most of models are right-skewed.

df.isnull().sum()

popularity              0
title                   0
director                0
keywords                0
runtime                 0
genres                  0
production_companies    0
vote_count              0
vote_average            0
release_year            0
budget                  0
revenue                 0
dtype: int64

Double check if there are any missing values

df.head(3)

The dataset is now clean as seen above.

Exploratory Data Analysis¶

Q1: Which genres are most popular overtime ?¶

arr_genres = set([])
for genre in df["genres"].unique():
    temp = genre.split("|")
    arr_genres.update(temp)

Get all unique genres using for loop and reference from this topic (set.update())

for name in list(arr_genres):
    df[name] = 0

Create and add a new column for each unique genre to the dataset

for index, row in df.iterrows():
    genres = row["genres"].split("|")
    for genre in genres:
        df.loc[index, genre] = 1

Assign genres to their columns with reference from this link

df.head(3)

df[list(arr_genres)].sum().nlargest(10).plot(kind="bar", figsize=(10,6))
plt.xlabel("Genres")
plt.ylabel("Number of Movies")
plt.title("Top 10 Most Popular Genres Overtime");

The barchart shows that the most favorite genres that are found in the majority of movies in the dataset are drama, comedy, thriller, action, romance, horror, adventure, crime, science fiction and family.

Q2: What are time series of top 5 movie genres?¶

df.groupby("release_year")[["Drama", "Comedy", "Thriller", "Action", "Romance"]].sum().plot(figsize=(8,6))
plt.xlabel("Year")
plt.ylabel("Number of Movies")
plt.title("Time Series of Top 5 Movie Genres");

Trends of genres as seen in the time series above describe some interesting information:

Drama is consistently the most on-trend genre. Especially, since around 2003-2004, the number of movies of this genre rose remarkably
Followed by drama, comedy and thriller rank 2nd and 3rd respectively. Interestingly, thriller has gained more interest over comedy in recent years from about 2014
Action and romance always stand at the bottom of this top-5 list over time

Q3: What are factors associated with revenue levels?¶

df.describe().revenue

count    8.692000e+03
mean     6.343420e+07
std      1.590868e+08
min      0.000000e+00
25%      0.000000e+00
50%      1.712360e+05
75%      5.493202e+07
max      2.827124e+09
Name: revenue, dtype: float64

View the min, 25%, 50%, 75%, max revenue values with Pandas describe

bin_edges = [0.000000e+00, 1.712360e+05, 5.493202e+07, 2.827124e+09]

Use bin edges to cut the data into groups

bin_names = ['low', 'medium', 'high']

Label 3 revenue groups

df['revenue_levels'] = pd.cut(df['revenue'], bin_edges, labels=bin_names, include_lowest = True)
df.tail()

fig, axes = plt.subplots(1,3,figsize=(16,6))
labels = ["low", "medium", "high"]
axes[0].bar(labels, df.groupby("revenue_levels").vote_count.mean())
axes[0].title.set_text("Effects of Average Vote Counts on Revenue Levels")
axes[0].set_xlabel("revenue levels")
axes[0].set_ylabel("average of vote counts")

axes[1].bar(labels, df.groupby("revenue_levels").popularity.mean())
axes[1].title.set_text("Effects of Average Popularity on Revenue Levels")
axes[1].set_xlabel("revenue levels")
axes[1].set_ylabel("average of popularity")

axes[2].bar(labels, df.groupby("revenue_levels").budget.mean())
axes[2].title.set_text("Effects of Average Budget on Revenue Levels")
axes[2].set_xlabel("revenue levels")
axes[2].set_ylabel("average of budget");

There are 2 factors associated with revenue: average of vote counts and average of popularity. In both cases, the higher the factors, the higher revenue.

df.groupby("director")["revenue"].sum().sort_values(ascending=False).nlargest(10).plot(kind="bar")
plt.xlabel("directors")
plt.ylabel("revenue")
plt.title("Top 10 Directors with Highest Total Movie Revenues");

df.nlargest(100, columns="revenue" )["director"].value_counts()

Steven Spielberg                7
Peter Jackson                   6
Michael Bay                     4
David Yates                     4
George Lucas                    4
                               ..
Garry Marshall                  1
Steve Martino|Mike Thurmeier    1
Mike Newell                     1
Ron Howard                      1
Andrew Adamson                  1
Name: director, Length: 64, dtype: int64

Director could be considered as another factor affecting revenues of movies

The barchart reveals that Steven Spielberg is director of movies having highest total revenues.
Using value_counts and nlargest, I find out that the majority of movies (7 movies) in top 100 having the highest revenues also belonged to Steven Spielberg.
Other names appear in both list are Peter jackson, George Lucas, Ron Howard and Michael Bay.

Conclusions¶

To sum up, the most popular genres overtime are drama, comedy, thriller, action, romance, horror, adventure, crime, science fiction and family. Tastes of movie genres may change slightly from year to year, but drama is still the all-time favorite genre.

When it comes to factors affecting movie revenues, averages of popularity, vote counts and budget present clearly positive correlations with revenue. Another factor of director also draws some attention and might be associated with revenue.

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10866.000000	10866.000000	1.086600e+04	1.086600e+04	10866.000000	10866.000000	10866.000000	10866.000000	1.086600e+04	1.086600e+04
mean	66064.177434	0.646441	1.462570e+07	3.982332e+07	102.070863	217.389748	5.974922	2001.322658	1.755104e+07	5.136436e+07
std	92130.136561	1.000185	3.091321e+07	1.170035e+08	31.381405	575.619058	0.935142	12.812941	3.430616e+07	1.446325e+08
min	5.000000	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	10596.250000	0.207583	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	20669.000000	0.383856	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	75610.000000	0.713817	1.500000e+07	2.400000e+07	111.000000	145.750000	6.600000	2011.000000	2.085325e+07	3.369710e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

	popularity	original_title	director	keywords	runtime	genres	production_companies	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	32.985763	Jurassic World	Colin Trevorrow	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	5562	6.5	2015	1.379999e+08	1.392446e+09
1	28.419936	Mad Max: Fury Road	George Miller	future\|chase\|post-apocalyptic\|dystopia\|australia	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	6185	7.1	2015	1.379999e+08	3.481613e+08

	popularity	title	director	keywords	runtime	genres	production_companies	vote_count	vote_average	release_year	budget	revenue
0	32.985763	Jurassic World	Colin Trevorrow	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	5562	6.5	2015	1.379999e+08	1.392446e+09
1	28.419936	Mad Max: Fury Road	George Miller	future\|chase\|post-apocalyptic\|dystopia\|australia	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	6185	7.1	2015	1.379999e+08	3.481613e+08
2	13.112507	Insurgent	Robert Schwentke	based on novel\|revolution\|dystopia\|sequel\|dyst...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	2480	6.3	2015	1.012000e+08	2.716190e+08

	popularity	title	director	keywords	runtime	genres	production_companies	vote_count	vote_average	release_year	...	Adventure	Thriller
0	32.985763	Jurassic World	Colin Trevorrow	monster\|dna\|tyrannosaurus rex\|velociraptor\|island	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	5562	6.5	2015	...	1	1
1	28.419936	Mad Max: Fury Road	George Miller	future\|chase\|post-apocalyptic\|dystopia\|australia	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	6185	7.1	2015	...	1	1
2	13.112507	Insurgent	Robert Schwentke	based on novel\|revolution\|dystopia\|sequel\|dyst...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	2480	6.3	2015	...	1	1

	popularity	title	director	keywords	runtime	genres	production_companies	vote_count	vote_average	release_year	...	Comedy	Documentary	Drama	revenue_levels
10861	0.080598	The Endless Summer	Bruce Brown	surfer\|surfboard\|surfing	95	Documentary	Bruce Brown Films	11	7.4	1966	...	0	1	0	low
10862	0.065543	Grand Prix	John Frankenheimer	car race\|racing\|formula 1	176	Action\|Adventure\|Drama	Cherokee Productions\|Joel Productions\|Douglas ...	20	5.7	1966	...	0	0	1	low
10863	0.065141	Beregis Avtomobilya	Eldar Ryazanov	car\|trolley\|stealing car	94	Mystery\|Comedy	Mosfilm	11	6.5	1966	...	1	0	0	low
10864	0.064317	What's Up, Tiger Lily?	Woody Allen	spoof	80	Action\|Comedy	Benedict Pictures Corp.	22	5.4	1966	...	1	0	0	low
10865	0.035919	Manos: The Hands of Fate	Harold P. Warren	fire\|gun\|drive\|sacrifice\|flashlight	74	Horror	Norm-Iris	15	1.5	1966	...	0	0	0	low