Simple Visualization on Goodreads-books¶

Author: Zhanglin Liu

Date: September 05, 2020

1 Background¶

The dataset used in this project is a Kaggle open dataset named Goodreads-books. It provides a comprehensive list of all books listed in goodreads. This dataset was last updated on March 09, 2020.

We are trying to answer the following questions:

Which top 5 books have the highest average rating and from which publisher?
How many books were published since 1990?

2 Data Preparation¶

There are 4 lines with unmatched column size, to successfully load using pandas, these 4 lines are skipped.

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('C:/Users/Elle/Desktop/jupyter_folder/datasets/books.csv', error_bad_lines = False)
df.head()

b'Skipping line 3350: expected 12 fields, saw 13\nSkipping line 4704: expected 12 fields, saw 13\nSkipping line 5879: expected 12 fields, saw 13\nSkipping line 8981: expected 12 fields, saw 13\n'

2.1 The Dataset¶

The following provides a basic information on the interested dataset. Below shows that there are 11,123 entries.

df['bookID'].nunique()

11123

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11123 entries, 0 to 11122
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              11123 non-null  int64  
 1   title               11123 non-null  object 
 2   authors             11123 non-null  object 
 3   average_rating      11123 non-null  float64
 4   isbn                11123 non-null  object 
 5   isbn13              11123 non-null  int64  
 6   language_code       11123 non-null  object 
 7     num_pages         11123 non-null  int64  
 8   ratings_count       11123 non-null  int64  
 9   text_reviews_count  11123 non-null  int64  
 10  publication_date    11123 non-null  object 
 11  publisher           11123 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 1.0+ MB

2.2 Data Cleaning¶

Notice the data type of 'publication_date' column is type object, since we are going to work with dates, it is better to convert the data type to datetime format.

df['publication_date'] = pd.to_datetime(df['publication_date'], format='%m/%d/%Y', errors='coerce')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11123 entries, 0 to 11122
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   bookID              11123 non-null  int64         
 1   title               11123 non-null  object        
 2   authors             11123 non-null  object        
 3   average_rating      11123 non-null  float64       
 4   isbn                11123 non-null  object        
 5   isbn13              11123 non-null  int64         
 6   language_code       11123 non-null  object        
 7     num_pages         11123 non-null  int64         
 8   ratings_count       11123 non-null  int64         
 9   text_reviews_count  11123 non-null  int64         
 10  publication_date    11121 non-null  datetime64[ns]
 11  publisher           11123 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(5), object(5)
memory usage: 1.0+ MB

3 Simple Statistics¶

3.1 Top 5 books with the highest average rating and from which publisher?¶

To taking into account that some books may have a few rating counts but high reivews, we will be only considering books with at least 50 reviews.

tempDf = df.loc[df['ratings_count'] >= 50]
hi_avg_rating = tempDf.groupby(['title','publisher'])['average_rating'].max().sort_values(ascending=False).head(5).reset_index()
hi_avg_rating

fig, ax = plt.subplots(figsize = (5,3))
ax.set_title('Top 5 books with their ratings', fontsize = 12, weight = 'bold')
ax.set_ylabel('Book Title')
ax.set_xlabel('Average ratings')
ax.barh(hi_avg_rating['title'],hi_avg_rating['average_rating'])
for i in ['top','bottom','left','right']:
    ax.spines[i].set_visible(False)
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.grid(False)
ax.set_facecolor("white")
ax.invert_yaxis()
for j in ax.patches:
    plt.text(j.get_width()+0.2, j.get_y()+0.5, 
             str(round((j.get_width()),2)), fontsize = 10, color = 'black')
plt.show()

3.2 How many books were published since 1990?¶

As shown below, 10,168 unique books were published since year 1990. A 'publication_year' column is added to provide a better visualization chart.

df_pub = df.loc[df['publication_date'] >= '01/01/1990']
df_pub['publication_year'] = pd.DatetimeIndex(df_pub['publication_date']).year
df_pub.shape

(10168, 13)

df_pub_temp = df_pub.groupby(['publication_year'])['bookID'].count().reset_index()
df_pub_temp

df_pub_temp.describe()

fig_pub, ax_pub = plt.subplots()
fig_pub = plt.figure(figsize = (5,3))
ax_pub.set_title('Number of published books since 1990 ', fontsize = 12, weight = 'bold')
ax_pub.set_xlabel('Publication Year')
ax_pub.set_ylabel('Number of Books')
ax_pub.grid(False)
ax_pub.set_facecolor("white")
ax_pub.bar(df_pub_temp['publication_year'],df_pub_temp['bookID'])
plt.show()

<Figure size 360x216 with 0 Axes>

One interesting observation from the above bar chart is that the rate at which the books were published was at an increasing slope between 1990 to 2006. At around 2007, the slope started to decline.

	bookID	title	authors	average_rating	isbn	isbn13	language_code	num_pages	ratings_count	text_reviews_count	publication_date	publisher
0	1	Harry Potter and the Half-Blood Prince (Harry ...	J.K. Rowling/Mary GrandPré	4.57	0439785960	9780439785969	eng	652	2095690	27591	9/16/2006	Scholastic Inc.
1	2	Harry Potter and the Order of the Phoenix (Har...	J.K. Rowling/Mary GrandPré	4.49	0439358078	9780439358071	eng	870	2153167	29221	9/1/2004	Scholastic Inc.
2	4	Harry Potter and the Chamber of Secrets (Harry...	J.K. Rowling	4.42	0439554896	9780439554893	eng	352	6333	244	11/1/2003	Scholastic
3	5	Harry Potter and the Prisoner of Azkaban (Harr...	J.K. Rowling/Mary GrandPré	4.56	043965548X	9780439655484	eng	435	2339585	36325	5/1/2004	Scholastic Inc.
4	8	Harry Potter Boxed Set Books 1-5 (Harry Potte...	J.K. Rowling/Mary GrandPré	4.78	0439682584	9780439682589	eng	2690	41428	164	9/13/2004	Scholastic

	title	publisher	average_rating
0	The Complete Calvin and Hobbes	Andrews McMeel Publishing	4.82
1	Harry Potter Boxed Set Books 1-5 (Harry Potte...	Scholastic	4.78
2	It's a Magical World (Calvin and Hobbes #11)	Andrews McMeel Publishing	4.76
3	Early Color	Steidl	4.73
4	Harry Potter Collection (Harry Potter #1-6)	Scholastic	4.73

	publication_year	bookID
0	1990	117
1	1991	151
2	1992	183
3	1993	165
4	1994	220
5	1995	249
6	1996	250
7	1997	290
8	1998	396
9	1999	450
10	2000	533
11	2001	656
12	2002	798
13	2003	931
14	2004	1069
15	2005	1260
16	2006	1700
17	2007	518
18	2008	49
19	2009	42
20	2010	37
21	2011	24
22	2012	21
23	2013	15
24	2014	9
25	2015	10
26	2016	6
27	2017	7
28	2018	5
29	2019	6
30	2020	1

	publication_year	bookID
count	31.000000	31.000000
mean	2005.000000	328.000000
std	9.092121	427.870541
min	1990.000000	1.000000
25%	1997.500000	18.000000
50%	2005.000000	165.000000
75%	2012.500000	484.000000
max	2020.000000	1700.000000