Exploratory Data Analysis on Loan Training Dataset

Author: Zhanglin Liu

Date: 09/27/2020

Background

This project focuses on the Exploratory Data Analysis, including data exploration, data preparation, and data cleaning. The dataset of interest here is the training dataset of Loan Prediction Problem Dataset. The original problem is to predict customers' loan eligiblity based on their application details. This project does not cover the Loan Prediction steps nor the Machine Learning algorithms.

Data Dictionary:

  • Loan_ID: Unique Loan ID
  • Gender: Male/Female
  • Married: Applicant married Y/N
  • Dependents: Number of dependents
  • Education: Graduate/Undergrad
  • Self_Employed: Y/N
  • ApplicantIncome: Applicant Income
  • CoapplicantIncome: Coapplicant Income
  • LoanAmount: Loan amount in thousands
  • Loan_Amount_Term: Term of loan in months
  • Credit_History: 1 for meeting the guidelines, 0 for not meeting the guidelines
  • Property_Area: Urban/Semi Urban/Rural
  • Loan_Status: Loan approved Y/N

Loading Dataset

In [1]:
import pandas as pd
import seaborn as sns
from scipy.stats import chi2_contingency
from statsmodels.stats import weightstats as stests
import matplotlib.pyplot as plt
df = pd.read_csv("C:/Users/Elle/Desktop/jupyter_folder/datasets/loan_train.csv")
df.head()
Out[1]:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N
2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y
3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y
4 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y

Data Exploration and Preparation

Variable Identification

Although we are not going to solve the original Loan Prediction problem here, it is generally helpful to keep the orginal problem in mind while performing the data exploration analysis. Thus the types of variables can be defined as following:

(Possible) Predictor Variable:

  • Gender
  • Married
  • Dependents
  • Education
  • Self_Employed
  • ApplicantIncome
  • CoapplicationIncome
  • LoanAmount
  • Loan_Amount_Term
  • Credit_History
  • Property_Area

Target Variable:

  • Loan_Status
In [2]:
# dataset comprises of 614 observations and 13 charateristics
df.shape
Out[2]:
(614, 13)
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

Section Summary

The dataset can be separated into the following two variable categories:

Categorical:

  • Gender
  • Married
  • Education
  • Self_Employed
  • Loan_Amount_Term
  • Credit_History
  • Property_Area
  • Loan_Status

Continuous:

  • ApplicantIncome
  • CoapplicantIncome
  • LoanAmount

Univariate Analysis

Categorical Variables

In [4]:
freq_tbl1 = df.groupby(['Gender','Married','Education','Self_Employed'])
freq_tbl1.size()
Out[4]:
Gender  Married  Education     Self_Employed
Female  No       Graduate      No                53
                               Yes                8
                 Not Graduate  No                10
                               Yes                3
        Yes      Graduate      No                21
                               Yes                3
                 Not Graduate  No                 4
                               Yes                1
Male    No       Graduate      No                81
                               Yes               12
                 Not Graduate  No                25
                               Yes                4
        Yes      Graduate      No               223
                               Yes               38
                 Not Graduate  No                71
                               Yes                9
dtype: int64
In [5]:
freq_tbl2 = df.groupby(['Credit_History','Property_Area','Loan_Amount_Term'])
freq_tbl2.size()
Out[5]:
Credit_History  Property_Area  Loan_Amount_Term
0.0             Rural          180.0                 4
                               300.0                 1
                               360.0                21
                Semiurban      180.0                 1
                               300.0                 2
                               360.0                25
                               480.0                 1
                Urban          180.0                 5
                               360.0                20
                               480.0                 3
1.0             Rural          84.0                  1
                               180.0                 7
                               300.0                 2
                               360.0               123
                               480.0                 2
                Semiurban      36.0                  2
                               84.0                  1
                               120.0                 2
                               180.0                 9
                               240.0                 2
                               300.0                 4
                               360.0               159
                               480.0                 6
                Urban          60.0                  2
                               84.0                  1
                               120.0                 1
                               180.0                17
                               240.0                 1
                               300.0                 1
                               360.0               122
                               480.0                 2
dtype: int64
In [6]:
df.Loan_Status.value_counts()
Out[6]:
Y    422
N    192
Name: Loan_Status, dtype: int64

There are more loans approved than rejected.

Continuous Variables

In [7]:
df[['ApplicantIncome','CoapplicantIncome','LoanAmount']].describe()
Out[7]:
ApplicantIncome CoapplicantIncome LoanAmount
count 614.000000 614.000000 592.000000
mean 5403.459283 1621.245798 146.412162
std 6109.041673 2926.248369 85.587325
min 150.000000 0.000000 9.000000
25% 2877.500000 0.000000 100.000000
50% 3812.500000 1188.500000 128.000000
75% 5795.000000 2297.250000 168.000000
max 81000.000000 41667.000000 700.000000

Observations:

  1. there are missing values on the latter characteristic
  2. there is notably a large difference between the 75th percentile and max values of all three variables

Thus two observations suggest that there are missing values and outliers in our data set, treatments should be applied in order to avoid building a biased predictive model.

In [8]:
# indeed we found there are quite a few outliers
sns.boxplot(df['ApplicantIncome'])
sns.despine()
In [9]:
IQR_app = df.ApplicantIncome.quantile(0.75) - df.ApplicantIncome.quantile(0.25)
upper_limit_app = df.ApplicantIncome.quantile(0.75) + (IQR_app*1.5)
upper_limit_extreme_app = df.ApplicantIncome.quantile(0.75) + (IQR_app*2)
upper_limit_app, upper_limit_extreme_app
Out[9]:
(10171.25, 11630.0)
In [10]:
outlier_count_app = len(df[(df['ApplicantIncome'] > upper_limit_app)])
outlier_count_app
Out[10]:
50
In [11]:
# the ApplicantIncome distribution is right skewed 
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
ax.hist(df['ApplicantIncome'], bins=30)
plt.title('ApplicantIncome Distribution')
plt.xlabel('Applicant Income')
plt.ylabel('Frequency')
plt.show()
In [12]:
sns.boxplot(df['CoapplicantIncome'])
sns.despine()
In [13]:
IQR_coapp = df.CoapplicantIncome.quantile(0.75) - df.CoapplicantIncome.quantile(0.25)
upper_limit_coapp = df.CoapplicantIncome.quantile(0.75) + (IQR_coapp*1.5)
upper_limit_extreme_coapp = df.CoapplicantIncome.quantile(0.75) + (IQR_coapp*2)
upper_limit_coapp, upper_limit_extreme_coapp
Out[13]:
(5743.125, 6891.75)
In [14]:
outlier_count_coapp = len(df[(df['CoapplicantIncome'] > upper_limit_coapp)])
outlier_count_coapp
Out[14]:
18
In [15]:
sns.boxplot(df['LoanAmount'])
sns.despine()
In [16]:
IQR_loanAmt = df.LoanAmount.quantile(0.75) - df.LoanAmount.quantile(0.25)
upper_limit_loanAmt = df.LoanAmount.quantile(0.75) + (IQR_loanAmt*1.5)
upper_limit_extreme_loanAmt = df.LoanAmount.quantile(0.75) + (IQR_loanAmt*2)
upper_limit_loanAmt, upper_limit_extreme_loanAmt
Out[16]:
(270.0, 304.0)
In [17]:
outlier_count_loanAmt = len(df[(df['LoanAmount'] > upper_limit_loanAmt)])
outlier_count_loanAmt
Out[17]:
39
In [18]:
# the LoanAmount distribution is right skewed 
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
ax.hist(df['LoanAmount'], bins=100)
plt.title('Loan Amount Distribution')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.show()

Section Summary

Variables with missing values:

  • Gender (13 or 2.1%)
  • Married (3 or 0.5%)
  • Dependents (15 or 2.4%)
  • Self_Employed (32 or 5.2%)
  • LoanAmount (22 or 3.6%)
  • Loan_Amount_Term (14 or 2.3%)
  • Credit_History (50 or 8.1%)

Variables with outliers:

  • ApplicantIncome (43 outliers exceeding its maximum value)
  • CoapplicantIncome ( 17 outliers exceeding its maximum value)
  • LoanAmount (30 outliers exceeding its maximum value)

where "Maximum" refers to 75th Percentile + 1.5$$(Interqartile Range)*

Bivariate Analysis

Categorical and Categorical

Loan_Status and Gender
In [19]:
tbl = pd.crosstab(index = df['Loan_Status'], columns = df['Gender'])
tbl.index = ['Not Approved','Approved']
tbl
Out[19]:
Gender Female Male
Not Approved 37 150
Approved 75 339

Pearson's Chi-square Test

The chi-square test statistic for a test of independence of two categorical variables is as follows: $$ \begin{aligned} \chi^2 = \sum{(O - E)^2 / E} \;, \;Where\:O \:is \:the \:observed \:frequency \:and \\\:E \:is \:the \:expected \:frequency \end{aligned} $$

$ \\ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status \:and \:Gender \\H_1: \:There \:is \:a \:significant \:relation \:between \:the \:two $

Calculated chi-square value (orginal frequency table excluding missing values)

The calculated $\chi^2$ value is 0.2369751, where the degree of freedom is 1. The critical value of the $\chi^2$ distribution with df of 1 is 3.841.

Hence:

$$ \begin{aligned} critical\:value\:of\:\chi^2\: >=\:calculated\:value\:of\:\chi^2 \end{aligned} $$

Therefore $H_0$ is Accepted

In [20]:
# performing the test using Python
# frequency table without missing values
stat, p, dof, expected = chi2_contingency(tbl)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.7086529816451106
Independent (H_0 holds true)
In [21]:
# How would the table look like if we include counts of missing-value obersvations?
tbl2 = tbl
df_new = df[df['Gender'].isnull()]
tbl_tp = df_new.groupby('Loan_Status')
tbl_tp.size()
Out[21]:
Loan_Status
N    5
Y    8
dtype: int64
In [22]:
tbl2['NaN_val'] = [5,8]
tbl2
Out[22]:
Gender Female Male NaN_val
Not Approved 37 150 5
Approved 75 339 8
In [23]:
tbl2.plot.bar(xlabel = 'Loan Status', rot = 0)
Out[23]:
<AxesSubplot:xlabel='Loan Status'>

The above bar plot does not provide much evidence to show that there is a significant relationship between Loan_Status and Gender.

Calculated chi-square value (modified frequency table including counts of missing values)

The calculated $\chi^2$ value is 5.66887, where the degree of freedom is 2. The critical value of the $\chi^2$ distribution with df of 2 is 5.991.

Hence:

$$ \begin{aligned} critical\:value\:of\:\chi^2\: >=\:calculated\:value\:of\:\chi^2 \end{aligned} $$

Therefore $H_0$ is Accepted

In [24]:
# performing the test using Python
# frequency table with missing values
stat, p, dof, expected = chi2_contingency(tbl2)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.7573361980618397
Independent (H_0 holds true)

Both $\chi^2$ tests reject the $H_0$, sugguesting that there is no significant relationship between Loan_Status and Gender.

Loan_Status and Married
In [25]:
tbl_marrd = pd.crosstab(index = df['Loan_Status'], columns = df['Married'])
tbl_marrd.index = ['Not Approved','Approved']
tbl_marrd.columns = ['Not Married','Married']
tbl_marrd
Out[25]:
Not Married Married
Not Approved 79 113
Approved 134 285

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Loan\:Status\:and\:Married \\H_1:\: Negation\:of\:H_0 $

In [26]:
stat, p, dof, exptected = chi2_contingency(tbl_marrd)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.03439381301579988
Dependent (reject H_0)
Loan_Status and Education
In [27]:
tbl_ed = pd.crosstab(df['Loan_Status'],df['Education'])
tbl_ed.index = ['Not Approved','Approved']
tbl_ed
Out[27]:
Education Graduate Not Graduate
Not Approved 140 52
Approved 340 82
In [28]:
tbl_ed.plot.bar(xlabel = 'Loan Status', rot = 0)
Out[28]:
<AxesSubplot:xlabel='Loan Status'>

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status\:and\:Education \\H_1:\: Negation\:of\:H_0 $

In [29]:
stat, p, dof, exptected = chi2_contingency(tbl_ed)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.04309962129357355
Dependent (reject H_0)

The above probability indicates that the relationship between Loan_Status and Education is significant at 95% confidence.

Loan_Status and Self_Employed
In [30]:
tbl_emp = pd.crosstab(index = df['Loan_Status'], columns = df['Self_Employed'])
tbl_emp.index = ['Not Approved','Approved']
tbl_emp
Out[30]:
Self_Employed No Yes
Not Approved 157 26
Approved 343 56

Self_Employed has 32 missing values.

In [31]:
tbl_emp.plot.bar(xlabel = 'Loan Status',rot = 0)
Out[31]:
<AxesSubplot:xlabel='Loan Status'>

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status\:and\:Self\_Employed \\H_1:\: Negation\:of\:H_0 $

In [32]:
stat, p, dof, expected = chi2_contingency(tbl_emp)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.9420039242223718
Independent (H_0 holds true)

Chi-square test sugguests that there is no significant relationship between Loan_Status and Self_Employed.

Loan_Status and Loan_Amount_Term
In [33]:
tbl_term = pd.crosstab(index = df['Loan_Status'], columns = df['Loan_Amount_Term'])
tbl_term.index = ['Not Approved','Approved']
tbl_term
Out[33]:
Loan_Amount_Term 12.0 36.0 60.0 84.0 120.0 180.0 240.0 300.0 360.0 480.0
Not Approved 0 2 0 1 0 15 1 5 153 9
Approved 1 0 2 3 3 29 3 8 359 6
In [34]:
stat, p, dof, exptected = chi2_contingency(tbl_term)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.12185762421276597
Independent (H_0 holds true)
Loan_Status and Credit_History
In [35]:
tbl_crt = pd.crosstab(index = df['Loan_Status'], columns = df['Credit_History'])
tbl_crt.index = ['Not Approved','Approved']
tbl_crt.columns = ['Guidelines Not Met', 'Guidelines Met'] 
tbl_crt
Out[35]:
Guidelines Not Met Guidelines Met
Not Approved 82 97
Approved 7 378
  • Credit_History has 50 missing values
  • It seems to suggest that there are relatively more loans approved when the applicants' credit meets the guidelines
In [36]:
tbl_crt.plot.bar(xlabel = 'Loan_Status', rot = 0)
Out[36]:
<AxesSubplot:xlabel='Loan_Status'>

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status\:and\:Credit\_History \\H_1:\: Negation\:of\:H_0 $

In [37]:
stat, p, dof, expected = chi2_contingency(tbl_crt)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 7.184759548750746e-40
Dependent (reject H_0)
Loan_Status and Property_Area
In [38]:
tbl_area = pd.crosstab(index = df['Loan_Status'], columns = df['Property_Area'])
tbl_area.index = ['Not Approved','Approved']
tbl_area
Out[38]:
Property_Area Rural Semiurban Urban
Not Approved 69 54 69
Approved 110 179 133
In [39]:
tbl_area.plot.bar(xlabel = 'Loan Status', rot = 0)
Out[39]:
<AxesSubplot:xlabel='Loan Status'>

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status\:and\:Property\_Area \\H_1:\: Negation\:of\:H_0 $

In [40]:
stat, p, dof, exptected = chi2_contingency(tbl_area)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.0021360187811644915
Dependent (reject H_0)

Categorical and Continuous

Loan_Status and ApplicantIncome
In [41]:
df['ApplicantIncome'].groupby(df['Loan_Status']).describe()
Out[41]:
count mean std min 25% 50% 75% max
Loan_Status
N 192.0 5446.078125 6819.558528 150.0 2885.0 3833.5 5861.25 81000.0
Y 422.0 5384.068720 5765.441615 210.0 2877.5 3812.5 5771.50 63337.0
In [42]:
# data set of Loan_Status Not Approved
tbl_appN = df[(df['Loan_Status'] == 'N')]
tbl_appN = tbl_appN['ApplicantIncome']
tbl_appN 
Out[42]:
1       4583
7       3036
9      12841
13      1853
17      3510
       ...  
596     6383
597     2987
600      416
605     2400
613     4583
Name: ApplicantIncome, Length: 192, dtype: int64
In [43]:
# data set of Loan_Status Approved
tbl_appY = df[(df['Loan_Status'] == 'Y')]
tbl_appY = tbl_appY['ApplicantIncome']
tbl_appY
Out[43]:
0      5849
2      3000
3      2583
4      6000
5      5417
       ... 
608    3232
609    2900
610    4106
611    8072
612    7583
Name: ApplicantIncome, Length: 422, dtype: int64

Two sample Z-Test (ApplicantIncome splitted by Loan_Status)

$$ H_0: \:\mu_{N_{income}} =\:\mu_{Y_{income}} \\H_1:\: Negation\:of\:H_0 $$

Assuming two levels of samples are independent.

In [44]:
ztest, pval = stests.ztest(tbl_appN,tbl_appY,value = 0, alternative = 'two-sided')
alpha = 0.05
print("p value is " + str(pval))
if pval <= alpha:
    print('Two population means are not equal (reject H_0)')
else:
    print('Two population means are equal (H_0 holds true)')
p value is 0.9072495909325038
Two population means are equal (H_0 holds true)

Conclusion

There is no evidence that the difference in Applicant income will have an impact on loan eligibility.

In [45]:
fig, ax = plt.subplots(figsize = (6,5))
df.boxplot(column = 'ApplicantIncome', by = 'Loan_Status', ax = ax, grid = False)
plt.show()
Loan_Status and LoanAmount
In [46]:
df['LoanAmount'].groupby(df['Loan_Status']).describe()
Out[46]:
count mean std min 25% 50% 75% max
Loan_Status
N 181.0 151.220994 85.862783 9.0 100.0 129.0 176.0 570.0
Y 411.0 144.294404 85.484607 17.0 100.0 126.0 161.0 700.0
In [47]:
tbl_amtN = df[(df['Loan_Status'] == 'N')]
tbl_amtN = tbl_amtN['LoanAmount'].dropna() # we know there are missing values
tbl_amtN 
Out[47]:
1      128.0
7      158.0
9      349.0
13     114.0
17      76.0
       ...  
591    205.0
596    187.0
597     88.0
600    350.0
613    133.0
Name: LoanAmount, Length: 181, dtype: float64
In [48]:
tbl_amtY = df[(df['Loan_Status'] == 'Y')]
tbl_amtY = tbl_amtY['LoanAmount'].dropna() # missing values dropped
tbl_amtY 
Out[48]:
2       66.0
3      120.0
4      141.0
5      267.0
6       95.0
       ...  
608    108.0
609     71.0
610     40.0
611    253.0
612    187.0
Name: LoanAmount, Length: 411, dtype: float64

Two sample Z-Test (LoanAmount splitted by Loan_Status)

$$ H_0: \:\mu_{N_{LoanAmt}} =\:\mu_{Y_{LoanAmt}} \\H_1:\: Negation\:of\:H_0 $$

Assuming two levels of samples are independent.

In [49]:
ztest, pval = stests.ztest(tbl_amtN,tbl_amtY,value = 0, alternative = 'two-sided')
alpha = 0.05
print("p value is " + str(pval))
if pval <= alpha:
    print('Two population means are not equal (reject H_0)')
else:
    print('Two population means are equal (H_0 holds true)')
p value is 0.36436593320314825
Two population means are equal (H_0 holds true)

Conclusion

There is no evidence that the difference in Loan Amount will result in lower/higher loan acceptance rate.

In [50]:
fig, ax = plt.subplots(figsize = (6,5))
df.boxplot(column = 'LoanAmount', by = 'Loan_Status', ax = ax, grid = False)
plt.show()

Continuous and Contintuous

ApplicantIncome vs. LoanAmount
In [51]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(df['ApplicantIncome'],df['LoanAmount'])
plt.title('ApplicantIncome and LoanAmount Distribution')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount')
plt.show()

From the above scatter plot, it looks like there is a positive correlation between 'LoanAmount' and 'ApplicantIncome'. Yet here we have not performed any missing data nor outlier treatment, thus it is too early to make a conclusion.

CoapplicantIncome vs. LoanAmount
In [52]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(df['CoapplicantIncome'],df['LoanAmount'])
plt.title('CoapplicantIncome and LoanAmount Distribution')
plt.xlabel('Coapplicant Income')
plt.ylabel('Loan Amount')
plt.show()

It is hard to make any conclusion from the above scatter plot, but it looks like a portion of the points sugguests the positive correlation. Ideally to double check, we would calculate the correlation coefficient, but due to difference in two variables' shapes, we have to perform missing values treatment in order to proceed with correlation calculation.

Section Summary (where the missing values are simply ignored)

Predictor variables that are observed to have a statistacally significant association with the Target varialbe are as follows:

  • Married
  • Education
  • Credit_History
  • Property_Area

Missing value Treatment

From previous analysis, we know most of the variables have at least one missing value, below will examine the possible cause of the occurence of the missing values, in my opinion, for each variable.

  • Gender: Most likely be missing completely at random(MCAR)
  • Married: MCAR
  • Dependents: MCAR
  • Self_Employed: MCAR
  • LoanAmount: MCAR
  • Loan_Amount_Term: MCAR
  • Credit_History: Missing at random. Applicants who do not meeting credit guidelines tend to have higher missing value compare to those whose credit met the guidelines.

Statistically speaking, if the number of missing observations is less than 5% of the sample, we can drop them. In addition, if the missing value is determined to be MCAR, we can safely perform deletion of missing value cases. With that said, missing observations of Gender, Married, Dependents, Self_Employed, LoanAmount and Loan_Amount_Term can be dropped.

As for Credit_History, imputation will be conducted to replace the missing ones.

Credit_History Imputation

Before the actual imputation, we should examine if there's any correlation between Credit_History and other variables.

Credit_history vs Gender

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Gender \\H_1:\: Negation\:of\:H_0 $

In [53]:
tbl_cGender = pd.crosstab(df['Credit_History'],df['Gender'])
stat, p, dof, exptected = chi2_contingency(tbl_cGender)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.8164953298683041
Independent (H_0 holds true)

Credit_History vs Married

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Married \\H_1:\: Negation\:of\:H_0 $

In [54]:
tbl_cMarried = pd.crosstab(df['Credit_History'],df['Married'])
stat, p, dof, exptected = chi2_contingency(tbl_cMarried)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.9864309067859328
Independent (H_0 holds true)

Credit_History vs Education

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Education \\H_1:\: Negation\:of\:H_0 $

In [55]:
tbl_cEd = pd.crosstab(df['Credit_History'],df['Education'])
stat, p, dof, exptected = chi2_contingency(tbl_cEd)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.0714694601906817
Independent (H_0 holds true)

Credit_History vs Self_Employed

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Self\_Employed \\H_1:\: Negation\:of\:H_0 $

In [56]:
tbl_cEmp = pd.crosstab(df['Credit_History'],df['Self_Employed'])
stat, p, dof, exptected = chi2_contingency(tbl_cEmp)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.9376796063240526
Independent (H_0 holds true)

Credit_History vs Property_Area

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Property\_Area \\H_1:\: Negation\:of\:H_0 $

In [57]:
tbl_cArea = pd.crosstab(df['Credit_History'],df['Property_Area'])
stat, p, dof, exptected = chi2_contingency(tbl_cArea)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')
p value is 0.6020428719957087
Independent (H_0 holds true)

Credit_History vs ApplicantIncome

Two sample Z-Test (ApplicantIncome splitted by Loan_Status)

$$ H_0: \:\mu_{N_{applicantIncome}} =\:\mu_{Y_{applicantIncome}} \\H_1:\: Negation\:of\:H_0 $$
In [58]:
df['ApplicantIncome'].groupby(df['Credit_History']).describe()
Out[58]:
count mean std min 25% 50% 75% max
Credit_History
0.0 89.0 5679.438202 9301.896041 1500.0 2787.0 3547.0 5285.0 81000.0
1.0 475.0 5426.526316 5535.393384 150.0 2908.5 3859.0 5820.0 63337.0
In [59]:
tbl_incomeN = df[(df['Credit_History'] == 0.0)]
tbl_incomeN = tbl_incomeN['ApplicantIncome']
tbl_incomeN 
Out[59]:
7      3036
17     3510
20     7660
22     2600
23     3365
       ... 
576    3087
584    2787
589    2726
597    2987
613    4583
Name: ApplicantIncome, Length: 89, dtype: int64
In [60]:
tbl_incomeY = df[(df['Credit_History'] == 1.0)]
tbl_incomeY = tbl_incomeY['ApplicantIncome']
tbl_incomeY 
Out[60]:
0      5849
1      4583
2      3000
3      2583
4      6000
       ... 
608    3232
609    2900
610    4106
611    8072
612    7583
Name: ApplicantIncome, Length: 475, dtype: int64
In [61]:
ztest, pval = stests.ztest(tbl_incomeN,tbl_incomeY,value = 0, alternative = 'two-sided')
alpha = 0.05
print("p value is " + str(pval))
if pval <= alpha:
    print('Two population means are not equal (reject H_0)')
else:
    print('Two population means are equal (H_0 holds true)')
p value is 0.7271819479920407
Two population means are equal (H_0 holds true)

Observation: There isn't any correlation between Credit_History and other predictor variables.

Imputation using Most Frequent values

In [62]:
import numpy as np
from sklearn.impute import SimpleImputer
imr = SimpleImputer(missing_values = np.nan,strategy = 'most_frequent')
imr = imr.fit(df[['Credit_History']])
df['Credit_History'] = imr.transform(df[['Credit_History']]).ravel()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     614 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

Deletion

For simplicity, we will delete observations where any of the variables is missing due to MCAR.

In [63]:
# if proceed with dropping NaN before imputation
# the sample size will reduce down to 480
# yet if we drop observations where missing values are found after imputation
# the sample size is reduced to 523 rather
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 523 entries, 1 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            523 non-null    object 
 1   Gender             523 non-null    object 
 2   Married            523 non-null    object 
 3   Dependents         523 non-null    object 
 4   Education          523 non-null    object 
 5   Self_Employed      523 non-null    object 
 6   ApplicantIncome    523 non-null    int64  
 7   CoapplicantIncome  523 non-null    float64
 8   LoanAmount         523 non-null    float64
 9   Loan_Amount_Term   523 non-null    float64
 10  Credit_History     523 non-null    float64
 11  Property_Area      523 non-null    object 
 12  Loan_Status        523 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 57.2+ KB

Outlier Treatment

From previous analysis, where we ignored the missing-value observations, variables ApplicantIncome, CoapplicantIncome, and LoanAmount are found to have univariate outliers in the upper boundaries. Now let's see if there's any changes on outliers after our missing value treatment.

In [64]:
df[['ApplicantIncome','CoapplicantIncome','LoanAmount']].describe()
Out[64]:
ApplicantIncome CoapplicantIncome LoanAmount
count 523.000000 523.000000 523.000000
mean 5316.097514 1604.498891 145.485660
std 5549.509255 2576.549437 83.223682
min 150.000000 0.000000 9.000000
25% 2885.500000 0.000000 100.000000
50% 3850.000000 1167.000000 128.000000
75% 5825.000000 2292.500000 171.000000
max 81000.000000 33837.000000 650.000000
In [65]:
sns.boxplot(df['ApplicantIncome'])
sns.despine()
In [66]:
IQR_1 = df.ApplicantIncome.quantile(0.75) - df.ApplicantIncome.quantile(0.25)
upper_limit_1 = df.ApplicantIncome.quantile(0.75) + (IQR_1*1.5)
upper_limit_extreme_1 = df.ApplicantIncome.quantile(0.75) + (IQR_1*2)
upper_limit_1, upper_limit_extreme_1
Out[66]:
(10234.25, 11704.0)
In [67]:
outlier_count_1 = len(df[(df['ApplicantIncome'] > upper_limit_1)])
outlier_count_1 
# notice the number of the outliers did not change after missing-value treatment
Out[67]:
43

Trimming

We will use Trimming here as our outlier treatment method.

In [68]:
pd.set_option('mode.chained_assignment', None)
index_1 = df[(df['ApplicantIncome'] >= upper_limit_1)].index
#index_1
df.drop(index_1, inplace = True)
outlier_ct_1 = len(df[(df['ApplicantIncome'] > upper_limit_1)])
outlier_ct_1
Out[68]:
0
In [69]:
sns.boxplot(df['CoapplicantIncome'])
sns.despine()
In [70]:
IQR_2 = df.CoapplicantIncome.quantile(0.75) - df.CoapplicantIncome.quantile(0.25)
upper_limit_2 = df.CoapplicantIncome.quantile(0.75) + (IQR_2*1.5)
upper_limit_extreme_2 = df.CoapplicantIncome.quantile(0.75) + (IQR_2*2)
upper_limit_2, upper_limit_extreme_2
Out[70]:
(5842.5, 7011.0)
In [71]:
outlier_count_2 = len(df[(df['CoapplicantIncome'] > upper_limit_2)])
outlier_count_2 
# number of outliers reduced from 17 to 16 after missing-value treatment
Out[71]:
15
In [72]:
#pd.set_option('mode.chained_assignment', None)
index_2 = df[(df['CoapplicantIncome'] >= upper_limit_2)].index
#index_2
df.drop(index_2, inplace = True)
outlier_ct_2 = len(df[(df['CoapplicantIncome'] > upper_limit_2)])
outlier_ct_2
Out[72]:
0
In [73]:
sns.boxplot(df['LoanAmount'])
sns.despine()
In [74]:
IQR_3 = df.LoanAmount.quantile(0.75) - df.LoanAmount.quantile(0.25)
upper_limit_3 = df.LoanAmount.quantile(0.75) + (IQR_3*1.5)
upper_limit_extreme_3 = df.LoanAmount.quantile(0.75) + (IQR_3*2)
upper_limit_3, upper_limit_extreme_3
Out[74]:
(242.5, 271.0)
In [75]:
outlier_count_3 = len(df[(df['LoanAmount'] > upper_limit_3)])
outlier_count_3 
# number of outliers reduced from 30 to 23 after missing-value treatment
Out[75]:
23
In [76]:
#pd.set_option('mode.chained_assignment', None)
index_3 = df[(df['LoanAmount'] >= upper_limit_3)].index
#index_3
df.drop(index_3, inplace = True)
outlier_ct_3 = len(df[(df['LoanAmount'] > upper_limit_3)])
outlier_ct_3
Out[76]:
0

Variable Correlations After Data Cleaning

In [77]:
#encoding to numeric data type
code_numeric = {'Male':1, 'Female':2,
               'Yes': 1, 'No':2,
                'Graduate':1, 'Not Graduate':2,
                'Urban':1, 'Semiurban':2, 'Rural':3,
                'Y':1, 'N':0,
                '3+':3 }
df = df.applymap(lambda i: code_numeric.get(i) if i in code_numeric else i)
df['Dependents'] = pd.to_numeric(df.Dependents)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 442 entries, 1 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            442 non-null    object 
 1   Gender             442 non-null    int64  
 2   Married            442 non-null    int64  
 3   Dependents         442 non-null    int64  
 4   Education          442 non-null    int64  
 5   Self_Employed      442 non-null    int64  
 6   ApplicantIncome    442 non-null    int64  
 7   CoapplicantIncome  442 non-null    float64
 8   LoanAmount         442 non-null    float64
 9   Loan_Amount_Term   442 non-null    float64
 10  Credit_History     442 non-null    float64
 11  Property_Area      442 non-null    int64  
 12  Loan_Status        442 non-null    int64  
dtypes: float64(4), int64(8), object(1)
memory usage: 48.3+ KB
In [78]:
matrix = np.triu(df.corr())
fig, ax = plt.subplots(figsize = (10,10))
sns.heatmap(df.corr(), annot = True, mask = matrix, linewidths = .5, ax = ax)
Out[78]:
<AxesSubplot:>

Conclusion

Depending on different methods of missing-value and/or outlier treatments, the level of data loss might have impact on prediction models which are used to solve the original problem. This project is only to explore the dataset and to describe dataset characterizations through data visualization and statistical techniques. Below shows how much each variable correlates with the target variable.

In [79]:
m = matrix[:,11]
m = pd.DataFrame(m)
m1 = np.transpose(m)
In [80]:
m1.columns = (df.columns[1:])
m2 = np.transpose(m1)
new_col = ['corr_to_Loan_Status']
m2.columns =new_col
In [81]:
# sort predictor variables by their correlation strength to the Target variable
m2['corr_to_Loan_Status'] = m2['corr_to_Loan_Status'].abs()
m2.sort_values(by = new_col, ascending = False)
Out[81]:
corr_to_Loan_Status
Loan_Status 1.000000
Credit_History 0.541256
Married 0.110421
Education 0.099422
Gender 0.090109
Property_Area 0.064360
CoapplicantIncome 0.055412
Self_Employed 0.038612
ApplicantIncome 0.030803
Dependents 0.014669
Loan_Amount_Term 0.014219
LoanAmount 0.013060