Welcome to my investigation of TMDb movie data set which contains information about 10,000 movies. In terms of the data set, it was cleaned from original data on Kaggle, and is about to get cleaned second time to focus and answer the main questions.
After data analysis steps of data wrangling, and EDA, I hope to provide some useful insights about the data set through answering these following questions with visualizations:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Import statements of all packages that I use
df=pd.read_csv("movies.csv")
df.head()
df.shape
There are 10866 rows and 21 columns in total
df.describe()
Statistics summary I can extract to have an overall view of the dataset
df.info()
The infor method describes dtypes of the dataset, and shows that there are missing values that I need to work on.
df.drop(labels = ["id", "imdb_id", "budget", "revenue", "cast", "homepage", "tagline", "overview", "release_date"], axis = 1, inplace = True)
df.head(2)
Drop unnecessary columns that have no use for my dataset investigation. There are 2 budget and 2 revenue columns in which budget and revenue with "adj" are recalculated in terms of 2010 dollars, accounting inflation overtime. Therefore, I keep such last 2 columns with "adj" which are more updated and drop the other 2 without "adj".
df.dropna(inplace = True)
All the missing values belong to columns with dtype of string, so I decide to drop any rows with such null values.
df.shape
The dataset is now downsized to 8692 rows and 12 columns
df.rename(columns={"original_title":"title", "budget_adj":"budget", "revenue_adj":"revenue"}, inplace = True)
df.head(1)
Rename 3 columns in which "budget_adj" and "revenue_adj" are changed to "budget" and "revenue" respectively for easy following because now there is only 1 column of budget and one of revenue in the dataset.
df.hist(figsize=(12,8));
Overview of some columns in the dataset, and as can be seen, most of models are right-skewed.
df.isnull().sum()
Double check if there are any missing values
df.head(3)
The dataset is now clean as seen above.
arr_genres = set([])
for genre in df["genres"].unique():
temp = genre.split("|")
arr_genres.update(temp)
Get all unique genres using for loop and reference from this topic (set.update())
for name in list(arr_genres):
df[name] = 0
Create and add a new column for each unique genre to the dataset
for index, row in df.iterrows():
genres = row["genres"].split("|")
for genre in genres:
df.loc[index, genre] = 1
Assign genres to their columns with reference from this link
df.head(3)
df[list(arr_genres)].sum().nlargest(10).plot(kind="bar", figsize=(10,6))
plt.xlabel("Genres")
plt.ylabel("Number of Movies")
plt.title("Top 10 Most Popular Genres Overtime");
The barchart shows that the most favorite genres that are found in the majority of movies in the dataset are drama, comedy, thriller, action, romance, horror, adventure, crime, science fiction and family.
df.groupby("release_year")[["Drama", "Comedy", "Thriller", "Action", "Romance"]].sum().plot(figsize=(8,6))
plt.xlabel("Year")
plt.ylabel("Number of Movies")
plt.title("Time Series of Top 5 Movie Genres");
Trends of genres as seen in the time series above describe some interesting information:
df.describe().revenue
View the min, 25%, 50%, 75%, max revenue values with Pandas describe
bin_edges = [0.000000e+00, 1.712360e+05, 5.493202e+07, 2.827124e+09]
Use bin edges to cut the data into groups
bin_names = ['low', 'medium', 'high']
Label 3 revenue groups
df['revenue_levels'] = pd.cut(df['revenue'], bin_edges, labels=bin_names, include_lowest = True)
df.tail()
fig, axes = plt.subplots(1,3,figsize=(16,6))
labels = ["low", "medium", "high"]
axes[0].bar(labels, df.groupby("revenue_levels").vote_count.mean())
axes[0].title.set_text("Effects of Average Vote Counts on Revenue Levels")
axes[0].set_xlabel("revenue levels")
axes[0].set_ylabel("average of vote counts")
axes[1].bar(labels, df.groupby("revenue_levels").popularity.mean())
axes[1].title.set_text("Effects of Average Popularity on Revenue Levels")
axes[1].set_xlabel("revenue levels")
axes[1].set_ylabel("average of popularity")
axes[2].bar(labels, df.groupby("revenue_levels").budget.mean())
axes[2].title.set_text("Effects of Average Budget on Revenue Levels")
axes[2].set_xlabel("revenue levels")
axes[2].set_ylabel("average of budget");
There are 2 factors associated with revenue: average of vote counts and average of popularity. In both cases, the higher the factors, the higher revenue.
df.groupby("director")["revenue"].sum().sort_values(ascending=False).nlargest(10).plot(kind="bar")
plt.xlabel("directors")
plt.ylabel("revenue")
plt.title("Top 10 Directors with Highest Total Movie Revenues");
df.nlargest(100, columns="revenue" )["director"].value_counts()
Director could be considered as another factor affecting revenues of movies
To sum up, the most popular genres overtime are drama, comedy, thriller, action, romance, horror, adventure, crime, science fiction and family. Tastes of movie genres may change slightly from year to year, but drama is still the all-time favorite genre.
When it comes to factors affecting movie revenues, averages of popularity, vote counts and budget present clearly positive correlations with revenue. Another factor of director also draws some attention and might be associated with revenue.