Simple Visualization on Goodreads-books

Author: Zhanglin Liu

Date: September 05, 2020

1 Background

The dataset used in this project is a Kaggle open dataset named Goodreads-books. It provides a comprehensive list of all books listed in goodreads. This dataset was last updated on March 09, 2020.

We are trying to answer the following questions:

  • Which top 5 books have the highest average rating and from which publisher?
  • How many books were published since 1990?

2 Data Preparation

There are 4 lines with unmatched column size, to successfully load using pandas, these 4 lines are skipped.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from matplotlib import pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('C:/Users/Elle/Desktop/jupyter_folder/datasets/books.csv', error_bad_lines = False)
df.head()
b'Skipping line 3350: expected 12 fields, saw 13\nSkipping line 4704: expected 12 fields, saw 13\nSkipping line 5879: expected 12 fields, saw 13\nSkipping line 8981: expected 12 fields, saw 13\n'
Out[1]:
bookID title authors average_rating isbn isbn13 language_code num_pages ratings_count text_reviews_count publication_date publisher
0 1 Harry Potter and the Half-Blood Prince (Harry ... J.K. Rowling/Mary GrandPré 4.57 0439785960 9780439785969 eng 652 2095690 27591 9/16/2006 Scholastic Inc.
1 2 Harry Potter and the Order of the Phoenix (Har... J.K. Rowling/Mary GrandPré 4.49 0439358078 9780439358071 eng 870 2153167 29221 9/1/2004 Scholastic Inc.
2 4 Harry Potter and the Chamber of Secrets (Harry... J.K. Rowling 4.42 0439554896 9780439554893 eng 352 6333 244 11/1/2003 Scholastic
3 5 Harry Potter and the Prisoner of Azkaban (Harr... J.K. Rowling/Mary GrandPré 4.56 043965548X 9780439655484 eng 435 2339585 36325 5/1/2004 Scholastic Inc.
4 8 Harry Potter Boxed Set Books 1-5 (Harry Potte... J.K. Rowling/Mary GrandPré 4.78 0439682584 9780439682589 eng 2690 41428 164 9/13/2004 Scholastic

2.1 The Dataset

The following provides a basic information on the interested dataset. Below shows that there are 11,123 entries.

In [2]:
df['bookID'].nunique()
Out[2]:
11123
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11123 entries, 0 to 11122
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              11123 non-null  int64  
 1   title               11123 non-null  object 
 2   authors             11123 non-null  object 
 3   average_rating      11123 non-null  float64
 4   isbn                11123 non-null  object 
 5   isbn13              11123 non-null  int64  
 6   language_code       11123 non-null  object 
 7     num_pages         11123 non-null  int64  
 8   ratings_count       11123 non-null  int64  
 9   text_reviews_count  11123 non-null  int64  
 10  publication_date    11123 non-null  object 
 11  publisher           11123 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 1.0+ MB

2.2 Data Cleaning

Notice the data type of 'publication_date' column is type object, since we are going to work with dates, it is better to convert the data type to datetime format.

In [4]:
df['publication_date'] = pd.to_datetime(df['publication_date'], format='%m/%d/%Y', errors='coerce')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11123 entries, 0 to 11122
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   bookID              11123 non-null  int64         
 1   title               11123 non-null  object        
 2   authors             11123 non-null  object        
 3   average_rating      11123 non-null  float64       
 4   isbn                11123 non-null  object        
 5   isbn13              11123 non-null  int64         
 6   language_code       11123 non-null  object        
 7     num_pages         11123 non-null  int64         
 8   ratings_count       11123 non-null  int64         
 9   text_reviews_count  11123 non-null  int64         
 10  publication_date    11121 non-null  datetime64[ns]
 11  publisher           11123 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(5), object(5)
memory usage: 1.0+ MB

3 Simple Statistics

3.1 Top 5 books with the highest average rating and from which publisher?

To taking into account that some books may have a few rating counts but high reivews, we will be only considering books with at least 50 reviews.

In [5]:
tempDf = df.loc[df['ratings_count'] >= 50]
hi_avg_rating = tempDf.groupby(['title','publisher'])['average_rating'].max().sort_values(ascending=False).head(5).reset_index()
hi_avg_rating
Out[5]:
title publisher average_rating
0 The Complete Calvin and Hobbes Andrews McMeel Publishing 4.82
1 Harry Potter Boxed Set Books 1-5 (Harry Potte... Scholastic 4.78
2 It's a Magical World (Calvin and Hobbes #11) Andrews McMeel Publishing 4.76
3 Early Color Steidl 4.73
4 Harry Potter Collection (Harry Potter #1-6) Scholastic 4.73
In [6]:
fig, ax = plt.subplots(figsize = (5,3))
ax.set_title('Top 5 books with their ratings', fontsize = 12, weight = 'bold')
ax.set_ylabel('Book Title')
ax.set_xlabel('Average ratings')
ax.barh(hi_avg_rating['title'],hi_avg_rating['average_rating'])
for i in ['top','bottom','left','right']:
    ax.spines[i].set_visible(False)
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.grid(False)
ax.set_facecolor("white")
ax.invert_yaxis()
for j in ax.patches:
    plt.text(j.get_width()+0.2, j.get_y()+0.5, 
             str(round((j.get_width()),2)), fontsize = 10, color = 'black')
plt.show()

3.2 How many books were published since 1990?

As shown below, 10,168 unique books were published since year 1990. A 'publication_year' column is added to provide a better visualization chart.

In [7]:
df_pub = df.loc[df['publication_date'] >= '01/01/1990']
df_pub['publication_year'] = pd.DatetimeIndex(df_pub['publication_date']).year
df_pub.shape
Out[7]:
(10168, 13)
In [8]:
df_pub_temp = df_pub.groupby(['publication_year'])['bookID'].count().reset_index()
df_pub_temp
Out[8]:
publication_year bookID
0 1990 117
1 1991 151
2 1992 183
3 1993 165
4 1994 220
5 1995 249
6 1996 250
7 1997 290
8 1998 396
9 1999 450
10 2000 533
11 2001 656
12 2002 798
13 2003 931
14 2004 1069
15 2005 1260
16 2006 1700
17 2007 518
18 2008 49
19 2009 42
20 2010 37
21 2011 24
22 2012 21
23 2013 15
24 2014 9
25 2015 10
26 2016 6
27 2017 7
28 2018 5
29 2019 6
30 2020 1
In [9]:
df_pub_temp.describe()
Out[9]:
publication_year bookID
count 31.000000 31.000000
mean 2005.000000 328.000000
std 9.092121 427.870541
min 1990.000000 1.000000
25% 1997.500000 18.000000
50% 2005.000000 165.000000
75% 2012.500000 484.000000
max 2020.000000 1700.000000
In [10]:
fig_pub, ax_pub = plt.subplots()
fig_pub = plt.figure(figsize = (5,3))
ax_pub.set_title('Number of published books since 1990 ', fontsize = 12, weight = 'bold')
ax_pub.set_xlabel('Publication Year')
ax_pub.set_ylabel('Number of Books')
ax_pub.grid(False)
ax_pub.set_facecolor("white")
ax_pub.bar(df_pub_temp['publication_year'],df_pub_temp['bookID'])
plt.show()
<Figure size 360x216 with 0 Axes>

One interesting observation from the above bar chart is that the rate at which the books were published was at an increasing slope between 1990 to 2006. At around 2007, the slope started to decline.