Big Mart Sales Prediction

Author: Zhanglin Liu

Date: 10/03/2020

Background

The dataset consists of year 2013 Big Mart sales data for 1559 products across 10 stores in different cities. The goal of this project is to predict the sales of each product at a particular outlet using Linear Regression.

Data Dictionary

  • Item_Identifier: Unique product ID
  • Item_Weight: Weight of product
  • Item_Fat_Content: Whether the product is low fat or not
  • Item_Visibility: The % of total display area of all products in a store allocated to the particular product
  • Item_Type: The category to which the product belongs
  • Item_MRP: Maximum Retail Price (list price) of the product
  • Outlet_Identifier: Unique store ID
  • Outlet_Establishment_Year: The year in which store was established
  • Outlet_Size: The size of the store in terms of ground area covered
  • Outlet_Location_Type: The type of city in which the store is located
  • Outlet_Type: Whether the outlet is just a grocery store or some sort of supermarket
  • Item_Outlet_Sales: Sales of the product in the particular store. This is our Target variable.

Loading Dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
df_train = pd.read_csv('train_bigmartsales.csv')
df_test = pd.read_csv('test_bigmartsales.csv')
df_train.head()
Out[1]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales
0 FDA15 9.30 Low Fat 0.016047 Dairy 249.8092 OUT049 1999 Medium Tier 1 Supermarket Type1 3735.1380
1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OUT018 2009 Medium Tier 3 Supermarket Type2 443.4228
2 FDN15 17.50 Low Fat 0.016760 Meat 141.6180 OUT049 1999 Medium Tier 1 Supermarket Type1 2097.2700
3 FDX07 19.20 Regular 0.000000 Fruits and Vegetables 182.0950 OUT010 1998 NaN Tier 3 Grocery Store 732.3800
4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OUT013 1987 High Tier 3 Supermarket Type1 994.7052

Observation

  • Looking at the dataset carefully, we noticed that Item types such as Dairy, Meat, Fruits and Vegetables etc. have an Item_Identifier starting with "FD", while the Drinks have an Item_Identifer starting with "DR" and household type have Item_Identifer starting with "NC"
In [2]:
# Combine training and test datasets so we don't need to repeat steps when testing our prediction model
df_train['source'] = 'train'
df_test['source'] = 'test'
df = pd.concat([df_train, df_test], ignore_index = True)
In [3]:
df_train.shape, df_test.shape, df.shape
Out[3]:
((8523, 13), (5681, 12), (14204, 13))
In [4]:
# expecting 5681 missing values on variable Item_Outlet_Sales
# these missing values will be ignored since they are from test dataset
# and we are trying to predict these values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14204 entries, 0 to 14203
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            14204 non-null  object 
 1   Item_Weight                11765 non-null  float64
 2   Item_Fat_Content           14204 non-null  object 
 3   Item_Visibility            14204 non-null  float64
 4   Item_Type                  14204 non-null  object 
 5   Item_MRP                   14204 non-null  float64
 6   Outlet_Identifier          14204 non-null  object 
 7   Outlet_Establishment_Year  14204 non-null  int64  
 8   Outlet_Size                10188 non-null  object 
 9   Outlet_Location_Type       14204 non-null  object 
 10  Outlet_Type                14204 non-null  object 
 11  Item_Outlet_Sales          8523 non-null   float64
 12  source                     14204 non-null  object 
dtypes: float64(4), int64(1), object(8)
memory usage: 1.4+ MB

Observations

  • Categorical Variables:
    • Item_Identifier
    • Item_Fat_Content
    • Item_Type
    • Outlet_Identifier
    • Outlet_Size
    • Outlet_Location_Type
    • Outlet_Type
  • Continuous Variables:
    • Item_Weight
    • Item_Visibility
    • Item_MRP
    • Outlet_Establishment_Year
    • Item_Outlet_Sales
  • Variables with missing values:
    • Item_Weight: missing 17.2% of data (2439)
    • Outlet_Size: missing 28.3% of data (4016)
In [5]:
df.apply(lambda x: len(x.unique()))
Out[5]:
Item_Identifier               1559
Item_Weight                    416
Item_Fat_Content                 5
Item_Visibility              13006
Item_Type                       16
Item_MRP                      8052
Outlet_Identifier               10
Outlet_Establishment_Year        9
Outlet_Size                      4
Outlet_Location_Type             3
Outlet_Type                      4
Item_Outlet_Sales             3494
source                           2
dtype: int64

Some Observations

  • The above output shows there are 1559 unique products
  • 5 unique types of Fat contents
  • 16 unique Item types
  • 10 unique outlets

Data Exploration and Data Cleaning

Univariate Analysis

Categorical Variables

Frequency table of Item_Type

In [6]:
df.Item_Type.value_counts()
Out[6]:
Fruits and Vegetables    2013
Snack Foods              1989
Household                1548
Frozen Foods             1426
Dairy                    1136
Baking Goods             1086
Canned                   1084
Health and Hygiene        858
Meat                      736
Soft Drinks               726
Breads                    416
Hard Drinks               362
Others                    280
Starchy Foods             269
Breakfast                 186
Seafood                    89
Name: Item_Type, dtype: int64
In [7]:
fig, ax = plt.subplots(figsize = (7,5))
df['Item_Type'].value_counts().plot(ax = ax, kind = 'barh')
Out[7]:
<AxesSubplot:>

Observations

  • Some categories have substantial counts while some don't
    • Binning some categories together might be helpful
    • we've identified the relationship between the Item_Identifer and Item_Type, we could use this to re-categorize the types
In [8]:
# Recategorizing Item_Types
df['Item_Type_cat'] = df['Item_Identifier'].apply(lambda x: x[0:2])
df['Item_Type_cat'] = df['Item_Type_cat'].replace(['FD','DR','NC'],['Food','Drinks','Non-Consumable'])
df.Item_Type_cat.value_counts()
Out[8]:
Food              10201
Non-Consumable     2686
Drinks             1317
Name: Item_Type_cat, dtype: int64
In [9]:
fig, ax = plt.subplots(figsize = (8,5))
df['Item_Type_cat'].value_counts().plot(ax = ax, kind = 'bar', rot = 0)
Out[9]:
<AxesSubplot:>

Frequency table of Item_Fat_Content

In [10]:
df.Item_Fat_Content.value_counts()
Out[10]:
Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64
In [11]:
fig, ax = plt.subplots(figsize = (7,5))
df['Item_Fat_Content'].value_counts().plot(ax = ax, kind = 'bar', rot = 0)
Out[11]:
<AxesSubplot:>

Observations

  • Although the above table gives 5 types, it can actually be combined into 3 unique types:
    • Low Fat
    • Regular
    • Non-Consumable (since there are items of health and hygiene types)
In [12]:
# cleaning up the Item_Fat_Contents
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace(['LF','low fat'],'Low Fat')
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace(['reg'],'Regular')
df.loc[df['Item_Type_cat']=='Non-Consumable','Item_Fat_Content'] = 'Non-Consumable'
df.Item_Fat_Content.value_counts()
Out[12]:
Low Fat           6499
Regular           5019
Non-Consumable    2686
Name: Item_Fat_Content, dtype: int64
In [13]:
fig, ax = plt.subplots(figsize = (8,5))
df['Item_Fat_Content'].value_counts().plot(ax = ax, kind = 'bar', rot = 0)
Out[13]:
<AxesSubplot:>

Frequency table of Outlet_Location_Type

In [14]:
df.Outlet_Location_Type.value_counts().sort_index()
Out[14]:
Tier 1    3980
Tier 2    4641
Tier 3    5583
Name: Outlet_Location_Type, dtype: int64
In [15]:
fig, ax = plt.subplots(figsize = (8,5))
df['Outlet_Location_Type'].value_counts().sort_index().plot(ax = ax, kind = 'bar', rot = 0)
Out[15]:
<AxesSubplot:>
In [16]:
# frequency table of Outlet_Type
df.Outlet_Type.value_counts().sort_index()
Out[16]:
Grocery Store        1805
Supermarket Type1    9294
Supermarket Type2    1546
Supermarket Type3    1559
Name: Outlet_Type, dtype: int64
In [17]:
fig, ax = plt.subplots(figsize = (8,5))
df['Outlet_Type'].value_counts().plot(ax = ax, kind = 'pie')
Out[17]:
<AxesSubplot:ylabel='Outlet_Type'>
In [18]:
# frequency table of Outlet_Size
# expected missing values here
df.Outlet_Size.value_counts().sort_index()
Out[18]:
High      1553
Medium    4655
Small     3980
Name: Outlet_Size, dtype: int64
In [19]:
fig, ax = plt.subplots(figsize = (8,5))
df['Outlet_Size'].value_counts().sort_index().plot(ax = ax, kind = 'bar', rot = 0)
# notice there are relatively more Medium Outlet size compared to other sizes
Out[19]:
<AxesSubplot:>
In [20]:
# print out the total number of missing value before imputation
sum(df['Outlet_Size'].isnull())
Out[20]:
4016
Imputation on Outlet_Size
In [21]:
imr = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
imr = imr.fit(df[['Outlet_Size']])
df['Outlet_Size'] = imr.transform(df[['Outlet_Size']]).ravel()

# print total number of missing values after imputation
sum(df['Outlet_Size'].isnull())
Out[21]:
0
In [22]:
df.Outlet_Size.value_counts().sort_index()
Out[22]:
High      1553
Medium    8671
Small     3980
Name: Outlet_Size, dtype: int64

Continuous Variables

In [23]:
df.describe()
Out[23]:
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_Outlet_Sales
count 11765.000000 14204.000000 14204.000000 14204.000000 8523.000000
mean 12.792854 0.065953 141.004977 1997.830681 2181.288914
std 4.652502 0.051459 62.086938 8.371664 1706.499616
min 4.555000 0.000000 31.290000 1985.000000 33.290000
25% 8.710000 0.027036 94.012000 1987.000000 834.247400
50% 12.600000 0.054021 142.247000 1999.000000 1794.331000
75% 16.750000 0.094037 185.855600 2004.000000 3101.296400
max 21.350000 0.328391 266.888400 2009.000000 13086.964800

Observations

  • Looks like there exists outliers in variable Item_Visibility and Item_Outlet_Sales
  • Missing values in Item_Weight
  • A value of 0 in Item_Visibility does not make sense
In [24]:
df.hist(bins = 20, color = 'steelblue', edgecolor = 'black', linewidth = 1.0,
       xlabelsize = 8, ylabelsize = 8, grid = False)
plt.tight_layout(rect = (0,0,1.2,1.2))
Imputation on Item_Weight
In [25]:
imr_wt = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imr_wt = imr_wt.fit(df[['Item_Weight']])
df['Item_Weight'] = imr_wt.transform(df[['Item_Weight']]).ravel()

# print total number of missing values after imputation
sum(df['Item_Weight'].isnull())
Out[25]:
0

Item_Visibility

In [26]:
# indeed there exists quite a few outliers in variable Item_Visibility
sns.boxplot(df['Item_Visibility'])
sns.despine()
A Little Cleaning on Item_Visibility

Although the above box plot shows outliers existing in this variable, we will still keep them as sometimes larger size items do take up more display area. However, as mentioned before, a display area percentage of 0% does not make any practical sense. We will replace all 0%'s using mean-value imputation.

In [27]:
np.sum([df['Item_Visibility'] == 0])
Out[27]:
879
In [28]:
imr_vs = SimpleImputer(missing_values = 0, strategy = 'mean')
imr_vs = imr_vs.fit(df[['Item_Visibility']])
df['Item_Visibility'] = imr_vs.transform(df[['Item_Visibility']]).ravel()

# print total number of missing values after imputation
np.sum([df['Item_Visibility'] == 0])
Out[28]:
0

Item_Outlet_Sales

In [29]:
sns.boxplot(df['Item_Outlet_Sales'])
sns.despine()
In [30]:
sns.distplot(df['Item_Outlet_Sales'])
# we dont perform any outlier treatment on target variable on this project
Out[30]:
<AxesSubplot:xlabel='Item_Outlet_Sales'>

Bivariate Analysis

In [31]:
sns.boxplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', data = df)
Out[31]:
<AxesSubplot:xlabel='Item_Fat_Content', ylabel='Item_Outlet_Sales'>
In [32]:
# since we already grouped items into 3 categories
sns.boxplot(x = 'Item_Type_cat', y = 'Item_Outlet_Sales', data = df)
Out[32]:
<AxesSubplot:xlabel='Item_Type_cat', ylabel='Item_Outlet_Sales'>
In [33]:
sns.boxplot(x = 'Outlet_Size', y = 'Item_Outlet_Sales', data = df,
           order =['Small','Medium','High'])
Out[33]:
<AxesSubplot:xlabel='Outlet_Size', ylabel='Item_Outlet_Sales'>
In [34]:
sns.boxplot(x = 'Outlet_Location_Type', y = 'Item_Outlet_Sales',data = df,
           order =['Tier 1','Tier 2','Tier 3'])
Out[34]:
<AxesSubplot:xlabel='Outlet_Location_Type', ylabel='Item_Outlet_Sales'>
In [35]:
sns.barplot(x = 'Outlet_Establishment_Year', y = 'Item_Outlet_Sales', data = df)
# we should transform 'Outlet_Establishment_Year' variable so we can get more insights
Out[35]:
<AxesSubplot:xlabel='Outlet_Establishment_Year', ylabel='Item_Outlet_Sales'>
In [36]:
sns.scatterplot(x = 'Item_Weight', y = 'Item_Outlet_Sales', data = df)
Out[36]:
<AxesSubplot:xlabel='Item_Weight', ylabel='Item_Outlet_Sales'>
In [37]:
sns.scatterplot(x = 'Item_Visibility', y = 'Item_Outlet_Sales',data = df)
Out[37]:
<AxesSubplot:xlabel='Item_Visibility', ylabel='Item_Outlet_Sales'>
In [38]:
sns.scatterplot(x = 'Item_MRP', y = 'Item_Outlet_Sales', data = df)
Out[38]:
<AxesSubplot:xlabel='Item_MRP', ylabel='Item_Outlet_Sales'>

Feature Engineering

In [39]:
# one-hot encoding
code_numeric = {'Low Fat': 1, 'Regular':2, 'Non-Consumable':3,
                'Food':1, 'Drinks': 2, 'Non-Consumable':3,
                'Small':1,'Medium':2,'High':3,
                'Supermarket Type1':1,'Supermarket Type2':2,'Supermarket Type3':3, 'Grocery Store':4,
                'Tier 1':1, 'Tier 2':2, 'Tier 3':3
               }
df = df.applymap(lambda i: code_numeric.get(i) if i in code_numeric else i)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14204 entries, 0 to 14203
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            14204 non-null  object 
 1   Item_Weight                14204 non-null  float64
 2   Item_Fat_Content           14204 non-null  int64  
 3   Item_Visibility            14204 non-null  float64
 4   Item_Type                  14204 non-null  object 
 5   Item_MRP                   14204 non-null  float64
 6   Outlet_Identifier          14204 non-null  object 
 7   Outlet_Establishment_Year  14204 non-null  int64  
 8   Outlet_Size                14204 non-null  int64  
 9   Outlet_Location_Type       14204 non-null  int64  
 10  Outlet_Type                14204 non-null  int64  
 11  Item_Outlet_Sales          8523 non-null   float64
 12  source                     14204 non-null  object 
 13  Item_Type_cat              14204 non-null  int64  
dtypes: float64(4), int64(6), object(4)
memory usage: 1.5+ MB
In [40]:
# transforming establishment year 
df['Outlet_Establishment_Year'] = 2013-df['Outlet_Establishment_Year']
df['Outlet_Establishment_Year'].value_counts()
Out[40]:
28    2439
26    1553
14    1550
9     1550
16    1550
11    1548
4     1546
6     1543
15     925
Name: Outlet_Establishment_Year, dtype: int64

Exporting datasets

In [41]:
df
Out[41]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales source Item_Type_cat
0 FDA15 9.30 1 0.016047 Dairy 249.8092 OUT049 14 2 1 1 3735.1380 train 1
1 DRC01 5.92 2 0.019278 Soft Drinks 48.2692 OUT018 4 2 3 2 443.4228 train 2
2 FDN15 17.50 1 0.016760 Meat 141.6180 OUT049 14 2 1 1 2097.2700 train 1
3 FDX07 19.20 2 0.070303 Fruits and Vegetables 182.0950 OUT010 15 2 3 4 732.3800 train 1
4 NCD19 8.93 3 0.070303 Household 53.8614 OUT013 26 3 3 1 994.7052 train 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14199 FDB58 10.50 2 0.013496 Snack Foods 141.3154 OUT046 16 1 1 1 NaN test 1
14200 FDD47 7.60 2 0.142991 Starchy Foods 169.1448 OUT018 4 2 3 2 NaN test 1
14201 NCO17 10.00 3 0.073529 Health and Hygiene 118.7440 OUT045 11 2 2 1 NaN test 3
14202 FDJ26 15.30 2 0.070303 Canned 214.6218 OUT017 6 2 2 1 NaN test 1
14203 FDU37 9.50 2 0.104720 Canned 79.7960 OUT045 11 2 2 1 NaN test 1

14204 rows × 14 columns

In [42]:
df.columns
Out[42]:
Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales', 'source', 'Item_Type_cat'],
      dtype='object')
In [43]:
IDcols = ['Item_Identifier', 'Outlet_Identifier']
ids = df_test[IDcols]
ids
Out[43]:
Item_Identifier Outlet_Identifier
0 FDW58 OUT049
1 FDW14 OUT017
2 NCN55 OUT010
3 FDQ58 OUT017
4 FDY38 OUT027
... ... ...
5676 FDB58 OUT046
5677 FDD47 OUT018
5678 NCO17 OUT045
5679 FDJ26 OUT017
5680 FDU37 OUT045

5681 rows × 2 columns

In [44]:
pd.set_option('mode.chained_assignment',None)
df.drop(['Item_Type'], axis = 1, inplace = True)
df.drop(IDcols, axis = 1, inplace = True)

df_train = df.loc[df['source'] == 'train']
df_test = df.loc[df['source'] == 'test']
df_train.drop(['source'], axis = 1, inplace = True)
df_test.drop(['source'], axis = 1, inplace = True)
In [45]:
df_train.head()
Out[45]:
Item_Weight Item_Fat_Content Item_Visibility Item_MRP Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type Item_Outlet_Sales Item_Type_cat
0 9.30 1 0.016047 249.8092 14 2 1 1 3735.1380 1
1 5.92 2 0.019278 48.2692 4 2 3 2 443.4228 2
2 17.50 1 0.016760 141.6180 14 2 1 1 2097.2700 1
3 19.20 2 0.070303 182.0950 15 2 3 4 732.3800 1
4 8.93 3 0.070303 53.8614 26 3 3 1 994.7052 3

Variable Correlations

In [46]:
fig, ax = plt.subplots(figsize = (10,10))
sns.heatmap(df_train.corr(), annot = True, linewidths = .5, ax = ax)
Out[46]:
<AxesSubplot:>

Predictive Modeling

Linear Regression

In [47]:
# defining relevant variables
y_train = df_train['Item_Outlet_Sales']
y_test = df_test['Item_Outlet_Sales']

X_train = df_train.drop(['Item_Outlet_Sales'], axis = 1)
X_test = df_test.drop(['Item_Outlet_Sales'], axis = 1)
In [48]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize = True)
model.fit(X_train, y_train)
model.intercept_, model.coef_
y_predict = model.predict(X_test)
y_predict
Out[48]:
array([1412.99658034, 1311.07132568, 2641.21370853, ..., 1862.95575275,
       3195.19446469, 1288.25108081])

Sales Prediction

In [49]:
ids['Item_Outlet_Sales'] = y_predict
prediction = pd.DataFrame(ids)
prediction.to_csv('submission.csv', index = False)
In [50]:
prediction
Out[50]:
Item_Identifier Outlet_Identifier Item_Outlet_Sales
0 FDW58 OUT049 1412.996580
1 FDW14 OUT017 1311.071326
2 NCN55 OUT010 2641.213709
3 FDQ58 OUT017 2377.367238
4 FDY38 OUT027 3831.932165
... ... ... ...
5676 FDB58 OUT046 2577.033091
5677 FDD47 OUT018 2242.118587
5678 NCO17 OUT045 1862.955753
5679 FDJ26 OUT017 3195.194465
5680 FDU37 OUT045 1288.251081

5681 rows × 3 columns

Conclusion

Taking a look at the result "submission.csv", 110 observations of negative sales are observed while the rest 5571 observations are positive sales. This is expected since we are using a linear model here, we could use a natural log on the Item_MRP during our analysis to help. However, we know the relationship here is not linear. This project is only for practicing purpose.