Capstone Project on Home Entertainment dataset¶

Author: Zhanglin Liu

Date: 12/19/2020

Background¶

This is a capstone project for Python Data Products for Predictive Analytics Specialization.

Project will include 4 tasks:

Data Processing
Classification
Regression
Recommender Sytstems

Task 1: Data Processing¶

The Data¶

The dataset of interest is the Amazon Customer Reviews Dataset on Home Entertainment Items, which can be found in Amazon Customer Reviews Library site.

Below is the data dictionary for this dataset:

marketplace: 2 letter country code of the marketplace where the review was written
customer_id: Random identifier that can be used to aggregate reviews written by a single author
review_id: The unique ID of the review
product_id: The unique Product ID the review pertains to. In the multilingual dataset the reviews
product_parent: Random identifier that can be used to aggregate reviews for the same product
product_title: Title of the product
product_category: Broad product category that can be used to group reviews
star_rating: The 1-5 star rating of the review
helpful_votes: Number of helpful votes
total_votes: Number of total votes the review received
vine: Review was written as part of the Vine program
verified_purchase: The review is on a verified purchase
review_headline: The title of the review
review_body: The review text
review_date: The date the review was written

Data Imports¶

import gzip
from collections import defaultdict
import random
import numpy 
import scipy.optimize
import string
from sklearn import linear_model
from nltk.stem.porter import PorterStemmer # Stemming

Read the data and Fill the dataset¶

Take care of int casting the votes and rating. And set the verified purchase column to Boolean type instead of string type

path = "amazon_reviews_us_Home_Entertainment_v1_00.tsv.gz"
f = gzip.open(path, 'rt', encoding = "utf8")
header = f.readline()
header = header.strip().split('\t')

dataset = []

for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    d['verified_purchase'] = d['verified_purchase']=='Y'
    dataset.append(d)

# below shows what a typical entry would look like
dataset[0]

{'marketplace': 'US',
 'customer_id': '179886',
 'review_id': 'RY01SAV7HZ8QO',
 'product_id': 'B00NTI0CQ2',
 'product_parent': '667358431',
 'product_title': 'Aketek 1080P LED Protable Projector HD PC AV VGA USB HDMI(Black)',
 'product_category': 'Home Entertainment',
 'star_rating': 4,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': 'N',
 'verified_purchase': True,
 'review_headline': 'good enough for my purpose',
 'review_body': 'not the best picture quality but surely suitable for random movie nights in open areas with sheesha.',
 'review_date': '2015-08-31'}

# shuffling data
random.shuffle(dataset)
dataset[0]

{'marketplace': 'US',
 'customer_id': '46900465',
 'review_id': 'R3EMF65AN108CA',
 'product_id': 'B001RCTAT2',
 'product_parent': '140372653',
 'product_title': 'Sharp LC52E77U 52-Inch 1080p 120Hz LCD HDTV, Black',
 'product_category': 'Home Entertainment',
 'star_rating': 5,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': 'N',
 'verified_purchase': True,
 'review_headline': 'Still going strong!',
 'review_body': "I've had my mishaps with it (mainboard fried), but the pixels are still intact and image quality is great! Not bad for.. I lost track of how long I've owned it. 7 years?! I think. If you come across a used unit for cheap, grab it immediately!",
 'review_date': '2014-04-25'}

Split the data into a Training and Testing set¶

Have Training be the first 80%, and testing be the remaining 20%.

N = len(dataset)
trainingSet = dataset[:4*N//5]
testSet = dataset[4*N//5:]

print(len(trainingSet), len(testSet))

564711 141178

Extracting Basic Statistics¶

Based on the Training Set, below are some questions we can answer:

What is the average rating?
What percentage of reviews are from verified purchases?
How many total users are there?
How many total items are there?
What percentage of reviews have 5-star ratings?

# functions
def average_rating(dataset):
    # initilizing
    rating = 0
    # looping through and sum every 'star_rating' element
    for i in range(len(dataset)):
        rating = dataset[i]['star_rating'] + rating
    
    #calculating the average rating
    avg_rating = rating/len(dataset)
    return avg_rating

def verified_purchases_ct(dataset):
    count = 0
    for i in range(len(dataset)):
        if dataset[i]['verified_purchase']:
            count +=1
    pct = count/len(dataset)*100
    return round(pct,2)

def total_usersOrItems(dataset, id_string):
    unique = set(dataset[i][id_string] for i in range(len(dataset)))
    return len(unique)

def five_star_ct(dataset):
    five_star = 0
    for i in range(len(dataset)):
        if (dataset[i]['star_rating']==5):
            five_star +=1
        continue 
    pct_fiveStar = five_star/len(dataset)*100
    return round(pct_fiveStar,2)

print("1. ", average_rating(trainingSet), 
      "\n2. ", verified_purchases_ct(trainingSet),"%"
     "\n3. ", total_usersOrItems(trainingSet, 'customer_id'),
     "\n4. ", total_usersOrItems(trainingSet, 'product_id'),
     "\n5. ", five_star_ct(trainingSet),"%")

1.  3.901932138739993 
2.  74.21 %
3.  499496 
4.  40058 
5.  52.99 %

Task 2: Classification¶

Use a Logistic Regression Model to extract features and make predictions based on them.

Define the feature function¶

This implementation will be based on the star rating and the length of the review body.

def feat_eng(dataset):
    count = 0
    for d in dataset:      
        for i in d['review_body']:           
            # note here we are including punctuations and whitespaces
            count+=1 if len(i)!=0 else 0   
        d['len_review'] = count
        count = 0
    d['len_review'] = int(d['len_review'])
    return dataset

# Return List of Feature Vectors 
def feature_vector (data):
    features = []
    for d in data:
        star_rating = d['star_rating']
        len_review = d['len_review']
        features.append([1,star_rating, len_review])
    return features

trainingSet = feat_eng(trainingSet)

testSet = feat_eng(testSet)

Predictive Model¶

Fit the model

Get features.
Create Label Vector based on the "verified purchase" column of training set.
Define the model as a Logistic Regression model.
Fit the model.

features_train = feature_vector(trainingSet)
features_test = feature_vector(testSet)

label_train = [d['verified_purchase'] for d in trainingSet]
label_test = [d['verified_purchase'] for d in testSet]

model = linear_model.LogisticRegression()
model.fit(features_train, label_train)
print(model.score(features_train,label_train))

0.7438176341526905

Compute Accuracy of the Model¶

Make Predictions based on the model.
Compute the Accuracy of the model.

label_pred_train = model.predict(features_train)
label_pred_test = model.predict(features_test)

correct_train = label_pred_train == label_train
accuracy_train = sum(correct_train)/len(correct_train)

correct = label_pred_test == label_test
accuracy = sum(correct)/len(correct)
print("Training accuracy of the model = ", accuracy_train)
print("Testing accuracy of the model = ", accuracy)

Training accuracy of the model =  0.7438176341526905
Testing accuracy of the model =  0.7446840159231608

Finding the Balanced Error Rate¶

Compute True and False Positives
Compute True and False Negatives
Compute Balanced Error Rate based on above defined variables.

TP = sum([(p and l) for (p,l) in zip(label_pred_test, label_test)])
FP = sum([(p and not l) for (p,l) in zip(label_pred_test, label_test)])
TN = sum([(not p and not l) for (p,l) in zip(label_pred_test, label_test)])
FN = sum([(not p and l) for (p,l) in zip(label_pred_test, label_test)])
print("TP = " + str(TP))
print("FP = " + str(FP))
print("TN = " + str(TN))
print("FN = " + str(FN))
BER = 0.5*(FP/(TN+FP) + FN/(FN+TP))
print("Balanced Error Rate = " + str(BER))

TP = 102494
FP = 33689
TN = 2639
FN = 2356
Balanced Error Rate = 0.4749132523502026

Task 3: Regression¶

Unique Words in a Sample Set¶

We are going to work with a smaller Sample Set here, as stemming on the normal training set will take a very long time.

Count the number of unique words found within the 'review body' portion of the sample set defined below, making sure to Ignore Punctuation and Capitalization.
Count the number of unique words found within the 'review body' portion of the sample set defined below, this time with use of Stemming, Ignoring Puctuation, and Capitalization.

wordCount = defaultdict(int)
punctuation = set(string.punctuation)

wordCountStem = defaultdict(int)
stemmer = PorterStemmer() #use stemmer.stem(stuff)

sampleSet = trainingSet[:2*len(trainingSet)//10]
len(sampleSet)

112942

def word_ct (dataset, string):
    if string == 'reg':
        for d in dataset:
            r = "".join([c for c in d["review_body"].lower() if not c in punctuation])
            for w in r.split():
                wordCount[w] += 1
        return wordCount
    elif string == 'stem':    
        for d in dataset:
            r = "".join([c for c in d["review_body"].lower() if not c in punctuation])
            for w in r.split():
                w = stemmer.stem(w)
                wordCountStem[w] += 1
        return wordCountStem

wordCount = word_ct(sampleSet, 'reg')

wordCountStem = word_ct(sampleSet, 'stem')

print("#1. Number of unique words without stemming: ", len(wordCount))
print("#2. Number of unique words with stemming: ", len(wordCountStem))

#1. Number of unique words without stemming:  125615
#2. Number of unique words with stemming:  106541

Evaluating Classifiers¶

Define X_reg vector. (This being the X vector, simply labeled for the Regression model)
Fit model using a Ridge Model with (alpha = 1.0, fit_intercept = True).
Using model, Make your Predictions.
Find the MSE between predictions and y_reg vector.

def feature_reg(datum):
    feat = [0]*len(words)
    review = d['review_body'].lower()
    r = ''.join([c for c in review if not c in punctuation])
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    return feat

def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

#Note: increasing the size of the dictionary may require a lot of memory
words = [x[1] for x in counts[:1000]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

X_train = [feature_reg(d) for d in sampleSet]
y_train = [d["star_rating"] for d in sampleSet] # the y_reg vector

X_test = [feature_reg(d) for d in testSet]
y_test = [d["star_rating"] for d in testSet]

rg_model = linear_model.Ridge(alpha = 1.0, fit_intercept = True) 
rg_model.fit(X_train,y_train)

# Predicting the star_rating for testSet based on X_test, which are the feature variables
X_pred = rg_model.predict(X_test)

# Below is the Logistic Regression Model
lg_model = linear_model.LogisticRegression() 
lg_model.fit(X_train,y_train)
lg_X_pred = lg_model.predict(X_test)

print('MSE score of Logistic Regression Model: ', MSE(lg_X_pred, y_test))
print('MSE score of Ridge Regression Model: ', MSE(X_pred, y_test))

MSE score of Logistic Regression Model:  3.3180948873053877
MSE score of Ridge Regression Model:  2.114467190871583

Here the Ridge Regression Model performs better as it gives a lower MSE value.

Task 4: Recommendation Systems¶

Using knowledge of simple latent factor-based recommender systems to make predictions. Then evaluating the performance of the predictions.

Starting up¶

Back to using the trainingSet.

#Create and fill our default dictionaries for our dataset
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)

for d in trainingSet:
    user,item = d['customer_id'], d['product_id']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)
    
#Create two dictionaries that will be filled with our rating prediction values
userBiases = defaultdict(float)
itemBiases = defaultdict(float)

#Getting the respective lengths of our dataset and dictionaries
N = len(trainingSet)
nUsers = len(reviewsPerUser)
nItems = len(reviewsPerItem)

#Getting the list of keys
users = list(reviewsPerUser.keys())
items = list(reviewsPerItem.keys())

### You will need to use this list
y_rec = [d['star_rating'] for d in trainingSet]

Calculate the ratingMean¶

Find the average rating of your training set.
Calculate a baseline MSE value from the actual ratings to the average ratings.

#1
avg_mean = average_rating(trainingSet)
ratingMean = [avg_mean for d in trainingSet]

#2
MSE(ratingMean, y_rec)

2.1180373710459253

Here we are defining the functions needed to optimize the MSE value.

alpha = avg_mean

def prediction(user, item):
    return alpha + userBiases[user] + itemBiases[item]

def unpack(theta):
    global alpha
    global userBiases
    global itemBiases
    alpha = theta[0]
    userBiases = dict(zip(users, theta[1:nUsers+1]))
    itemBiases = dict(zip(items, theta[1+nUsers:]))
    
def cost(theta, labels, lamb):
    unpack(theta)
    predictions = [prediction(d['customer_id'], d['product_id']) for d in trainingSet]
    cost = MSE(predictions, labels)
    print("MSE = " + str(cost))
    for u in userBiases:
        cost += lamb*userBiases[u]**2
    for i in itemBiases:
        cost += lamb*itemBiases[i]**2
    return cost

def derivative(theta, labels, lamb):
    unpack(theta)
    N = len(trainingSet)
    dalpha = 0
    dUserBiases = defaultdict(float)
    dItemBiases = defaultdict(float)
    for d in trainingSet:
        u,i = d['customer_id'], d['product_id']
        pred = prediction(u, i)
        diff = pred - d['star_rating']
        dalpha += 2/N*diff
        dUserBiases[u] += 2/N*diff
        dItemBiases[i] += 2/N*diff
    for u in userBiases:
        dUserBiases[u] += 2*lamb*userBiases[u]
    for i in itemBiases:
        dItemBiases[i] += 2*lamb*itemBiases[i]
    dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]
    return numpy.array(dtheta)

Optimize¶

Optimize your MSE using the scipy.optimize.fmin_1_bfgs_b("arguments") functions.

scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + [0.0]*(nUsers+nItems),
                             derivative, args = (y_rec, 0.001))

MSE = 2.1180373710459253
MSE = 2.1040955696488925
MSE = 2.148389521039812
MSE = 2.1018609951134137
MSE = 2.1012283298178764
MSE = 2.0956507055215843
MSE = 2.0791298386455326
MSE = 2.0644987624312896
MSE = 2.040253196391815
MSE = 2.0323517614703546
MSE = 2.029677706194477
MSE = 2.027773193011851
MSE = 2.026633228219697
MSE = 2.0257198686497206
MSE = 2.0255460792805833
MSE = 2.0251435461815506
MSE = 2.025399385964841
MSE = 2.0254074728724083
MSE = 2.025439357517924
MSE = 2.025464266218726
MSE = 2.0254327882942498
MSE = 2.025356745997963
MSE = 2.025324403901929
MSE = 2.0253059439706753
MSE = 2.025291602910186
MSE = 2.0252690137577036
MSE = 2.0252524621992216
MSE = 2.0252510076280106
MSE = 2.0252518909449444
MSE = 2.025255017455529
MSE = 2.025261140378528
MSE = 2.025263681281104
MSE = 2.0252665529961673
MSE = 2.025269503954881

(array([ 3.84079391e+00,  1.97552611e-03, -4.77640961e-03, ...,
         2.80354677e-04,  2.04331870e-03, -3.24557337e-03]),
 2.0586857848103164,
 {'grad': array([-4.42445919e-06, -7.95861317e-10,  2.98265832e-11, ...,
         -1.15460918e-09, -4.37358172e-09,  5.28333592e-09]),
  'task': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',
  'funcalls': 34,
  'nit': 30,
  'warnflag': 0})

Notice the optimized MSE converges to roughly 2.02527

Fill Dictionaries¶

usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)

itemTitle = {}
for d in trainingSet:
        user, item = d['customer_id'], d['product_id']
        usersPerItem[item].add(user)
        itemsPerUser[user].add(item)
        itemTitle[item] = d['product_title']

Jaccard Similarity Measure¶

def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

def mostSimilar(iD, m): #n is the entry index
    similarities = []  #m is the number of entries
    users = usersPerItem[iD]
    for i2 in usersPerItem:
        if i2 == iD: continue
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    return similarities[:m]

Getting Recommendation¶

query = trainingSet[10]['product_id']
print("Item id: ", query)
print("Corresponding item name: ", itemTitle[query])

Item id:  B00BGGDVOO
Corresponding item name:  Roku 3 Streaming Media Player

# showing the top 10 similar item IDs 
# and the similarity measure in descending order
mostSimilar(query,10)

[(0.007374912581855172, 'B00INNP5VU'),
 (0.00636846855882058, 'B00DR0PDNE'),
 (0.005158646032082538, 'B005CLPP84'),
 (0.004452817263230005, 'B008I64126'),
 (0.004301075268817204, 'B007I5JT4S'),
 (0.004259850905218318, 'B00F5NB7MW'),
 (0.0038022813688212928, 'B005CLPP8E'),
 (0.002748511223087494, 'B008R7EVE4'),
 (0.0027147273961239727, 'B007KEZMX4'),
 (0.0023286501591244274, 'B00F5NB7JK')]

# Below gives the name of these item_id
[(x[1], itemTitle[x[1]]) for x in mostSimilar(query,10)]

[('B00INNP5VU', 'Roku Streaming Stick (3500R) (2014 Model)'),
 ('B00DR0PDNE', 'Google Chromecast HDMI Streaming Media Player'),
 ('B005CLPP84', 'Roku 2 XS 1080p Streaming Player (Old Model)'),
 ('B008I64126', 'SquareTrade 2-Year Home AV Protection Plan ($75-100)'),
 ('B007I5JT4S', 'Apple TV MD199LL/A (Current Version)'),
 ('B00F5NB7MW', 'Roku 2 Streaming Player with Headphone Jack'),
 ('B005CLPP8E', 'Roku 2 XD Streaming Player 1080p (Old Version)'),
 ('B008R7EVE4', 'Roku LT Streaming Player (Old Version)'),
 ('B007KEZMX4', 'Roku HD Streaming Player (Old Model)'),
 ('B00F5NB7JK', 'Roku 1 Streaming Media Player (2710R)')]