Exploratory Data Analysis on Loan Training Dataset¶

Author: Zhanglin Liu

Date: 09/27/2020

Background¶

This project focuses on the Exploratory Data Analysis, including data exploration, data preparation, and data cleaning. The dataset of interest here is the training dataset of Loan Prediction Problem Dataset. The original problem is to predict customers' loan eligiblity based on their application details. This project does not cover the Loan Prediction steps nor the Machine Learning algorithms.

Data Dictionary:

Loan_ID: Unique Loan ID
Gender: Male/Female
Married: Applicant married Y/N
Dependents: Number of dependents
Education: Graduate/Undergrad
Self_Employed: Y/N
ApplicantIncome: Applicant Income
CoapplicantIncome: Coapplicant Income
LoanAmount: Loan amount in thousands
Loan_Amount_Term: Term of loan in months
Credit_History: 1 for meeting the guidelines, 0 for not meeting the guidelines
Property_Area: Urban/Semi Urban/Rural
Loan_Status: Loan approved Y/N

Loading Dataset¶

import pandas as pd
import seaborn as sns
from scipy.stats import chi2_contingency
from statsmodels.stats import weightstats as stests
import matplotlib.pyplot as plt
df = pd.read_csv("C:/Users/Elle/Desktop/jupyter_folder/datasets/loan_train.csv")
df.head()

Data Exploration and Preparation¶

Variable Identification¶

Although we are not going to solve the original Loan Prediction problem here, it is generally helpful to keep the orginal problem in mind while performing the data exploration analysis. Thus the types of variables can be defined as following:

(Possible) Predictor Variable:

Gender
Married
Dependents
Education
Self_Employed
ApplicantIncome
CoapplicationIncome
LoanAmount
Loan_Amount_Term
Credit_History
Property_Area

Target Variable:

Loan_Status

# dataset comprises of 614 observations and 13 charateristics
df.shape

(614, 13)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

Section Summary¶

The dataset can be separated into the following two variable categories:

Categorical:

Gender
Married
Education
Self_Employed
Loan_Amount_Term
Credit_History
Property_Area
Loan_Status

Continuous:

ApplicantIncome
CoapplicantIncome
LoanAmount

Univariate Analysis¶

Categorical Variables¶

freq_tbl1 = df.groupby(['Gender','Married','Education','Self_Employed'])
freq_tbl1.size()

Gender  Married  Education     Self_Employed
Female  No       Graduate      No                53
                               Yes                8
                 Not Graduate  No                10
                               Yes                3
        Yes      Graduate      No                21
                               Yes                3
                 Not Graduate  No                 4
                               Yes                1
Male    No       Graduate      No                81
                               Yes               12
                 Not Graduate  No                25
                               Yes                4
        Yes      Graduate      No               223
                               Yes               38
                 Not Graduate  No                71
                               Yes                9
dtype: int64

freq_tbl2 = df.groupby(['Credit_History','Property_Area','Loan_Amount_Term'])
freq_tbl2.size()

Credit_History  Property_Area  Loan_Amount_Term
0.0             Rural          180.0                 4
                               300.0                 1
                               360.0                21
                Semiurban      180.0                 1
                               300.0                 2
                               360.0                25
                               480.0                 1
                Urban          180.0                 5
                               360.0                20
                               480.0                 3
1.0             Rural          84.0                  1
                               180.0                 7
                               300.0                 2
                               360.0               123
                               480.0                 2
                Semiurban      36.0                  2
                               84.0                  1
                               120.0                 2
                               180.0                 9
                               240.0                 2
                               300.0                 4
                               360.0               159
                               480.0                 6
                Urban          60.0                  2
                               84.0                  1
                               120.0                 1
                               180.0                17
                               240.0                 1
                               300.0                 1
                               360.0               122
                               480.0                 2
dtype: int64

df.Loan_Status.value_counts()

Y    422
N    192
Name: Loan_Status, dtype: int64

There are more loans approved than rejected.

Continuous Variables¶

df[['ApplicantIncome','CoapplicantIncome','LoanAmount']].describe()

Observations:

there are missing values on the latter characteristic
there is notably a large difference between the 75th percentile and max values of all three variables

Thus two observations suggest that there are missing values and outliers in our data set, treatments should be applied in order to avoid building a biased predictive model.

# indeed we found there are quite a few outliers
sns.boxplot(df['ApplicantIncome'])
sns.despine()

IQR_app = df.ApplicantIncome.quantile(0.75) - df.ApplicantIncome.quantile(0.25)
upper_limit_app = df.ApplicantIncome.quantile(0.75) + (IQR_app*1.5)
upper_limit_extreme_app = df.ApplicantIncome.quantile(0.75) + (IQR_app*2)
upper_limit_app, upper_limit_extreme_app

(10171.25, 11630.0)

outlier_count_app = len(df[(df['ApplicantIncome'] > upper_limit_app)])
outlier_count_app

50

# the ApplicantIncome distribution is right skewed 
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
ax.hist(df['ApplicantIncome'], bins=30)
plt.title('ApplicantIncome Distribution')
plt.xlabel('Applicant Income')
plt.ylabel('Frequency')
plt.show()

sns.boxplot(df['CoapplicantIncome'])
sns.despine()

IQR_coapp = df.CoapplicantIncome.quantile(0.75) - df.CoapplicantIncome.quantile(0.25)
upper_limit_coapp = df.CoapplicantIncome.quantile(0.75) + (IQR_coapp*1.5)
upper_limit_extreme_coapp = df.CoapplicantIncome.quantile(0.75) + (IQR_coapp*2)
upper_limit_coapp, upper_limit_extreme_coapp

(5743.125, 6891.75)

outlier_count_coapp = len(df[(df['CoapplicantIncome'] > upper_limit_coapp)])
outlier_count_coapp

18

sns.boxplot(df['LoanAmount'])
sns.despine()

IQR_loanAmt = df.LoanAmount.quantile(0.75) - df.LoanAmount.quantile(0.25)
upper_limit_loanAmt = df.LoanAmount.quantile(0.75) + (IQR_loanAmt*1.5)
upper_limit_extreme_loanAmt = df.LoanAmount.quantile(0.75) + (IQR_loanAmt*2)
upper_limit_loanAmt, upper_limit_extreme_loanAmt

(270.0, 304.0)

outlier_count_loanAmt = len(df[(df['LoanAmount'] > upper_limit_loanAmt)])
outlier_count_loanAmt

39

# the LoanAmount distribution is right skewed 
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
ax.hist(df['LoanAmount'], bins=100)
plt.title('Loan Amount Distribution')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.show()

Section Summary¶

Variables with missing values:

Gender (13 or 2.1%)
Married (3 or 0.5%)
Dependents (15 or 2.4%)
Self_Employed (32 or 5.2%)
LoanAmount (22 or 3.6%)
Loan_Amount_Term (14 or 2.3%)
Credit_History (50 or 8.1%)

Variables with outliers:

ApplicantIncome (43 outliers exceeding its maximum value)
CoapplicantIncome ( 17 outliers exceeding its maximum value)
LoanAmount (30 outliers exceeding its maximum value)

where "Maximum" refers to 75th Percentile + 1.5$$(Interqartile Range)*

Bivariate Analysis¶

Categorical and Categorical¶

Loan_Status and Gender¶

tbl = pd.crosstab(index = df['Loan_Status'], columns = df['Gender'])
tbl.index = ['Not Approved','Approved']
tbl

Pearson's Chi-square Test

The chi-square test statistic for a test of independence of two categorical variables is as follows: $$ \begin{aligned} \chi^2 = \sum{(O - E)^2 / E} \;, \;Where\:O \:is \:the \:observed \:frequency \:and \\\:E \:is \:the \:expected \:frequency \end{aligned} $$

$ \\ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status \:and \:Gender \\H_1: \:There \:is \:a \:significant \:relation \:between \:the \:two $

Calculated chi-square value (orginal frequency table excluding missing values)

The calculated $\chi^2$ value is 0.2369751, where the degree of freedom is 1. The critical value of the $\chi^2$ distribution with df of 1 is 3.841.

Hence:

$$ \begin{aligned} critical\:value\:of\:\chi^2\: >=\:calculated\:value\:of\:\chi^2 \end{aligned} $$

Therefore $H_0$ is Accepted

# performing the test using Python
# frequency table without missing values
stat, p, dof, expected = chi2_contingency(tbl)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.7086529816451106
Independent (H_0 holds true)

# How would the table look like if we include counts of missing-value obersvations?
tbl2 = tbl
df_new = df[df['Gender'].isnull()]
tbl_tp = df_new.groupby('Loan_Status')
tbl_tp.size()

Loan_Status
N    5
Y    8
dtype: int64

tbl2['NaN_val'] = [5,8]
tbl2

tbl2.plot.bar(xlabel = 'Loan Status', rot = 0)

<AxesSubplot:xlabel='Loan Status'>

The above bar plot does not provide much evidence to show that there is a significant relationship between Loan_Status and Gender.

Calculated chi-square value (modified frequency table including counts of missing values)

The calculated $\chi^2$ value is 5.66887, where the degree of freedom is 2. The critical value of the $\chi^2$ distribution with df of 2 is 5.991.

Hence:

$$ \begin{aligned} critical\:value\:of\:\chi^2\: >=\:calculated\:value\:of\:\chi^2 \end{aligned} $$

Therefore $H_0$ is Accepted

# performing the test using Python
# frequency table with missing values
stat, p, dof, expected = chi2_contingency(tbl2)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.7573361980618397
Independent (H_0 holds true)

Both $\chi^2$ tests reject the $H_0$, sugguesting that there is no significant relationship between Loan_Status and Gender.

Loan_Status and Married¶

tbl_marrd = pd.crosstab(index = df['Loan_Status'], columns = df['Married'])
tbl_marrd.index = ['Not Approved','Approved']
tbl_marrd.columns = ['Not Married','Married']
tbl_marrd

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Loan\:Status\:and\:Married \\H_1:\: Negation\:of\:H_0 $

stat, p, dof, exptected = chi2_contingency(tbl_marrd)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.03439381301579988
Dependent (reject H_0)

Loan_Status and Education¶

tbl_ed = pd.crosstab(df['Loan_Status'],df['Education'])
tbl_ed.index = ['Not Approved','Approved']
tbl_ed

tbl_ed.plot.bar(xlabel = 'Loan Status', rot = 0)

<AxesSubplot:xlabel='Loan Status'>

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status\:and\:Education \\H_1:\: Negation\:of\:H_0 $

stat, p, dof, exptected = chi2_contingency(tbl_ed)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.04309962129357355
Dependent (reject H_0)

The above probability indicates that the relationship between Loan_Status and Education is significant at 95% confidence.

Loan_Status and Self_Employed¶

tbl_emp = pd.crosstab(index = df['Loan_Status'], columns = df['Self_Employed'])
tbl_emp.index = ['Not Approved','Approved']
tbl_emp

Self_Employed has 32 missing values.

tbl_emp.plot.bar(xlabel = 'Loan Status',rot = 0)

<AxesSubplot:xlabel='Loan Status'>

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status\:and\:Self\_Employed \\H_1:\: Negation\:of\:H_0 $

stat, p, dof, expected = chi2_contingency(tbl_emp)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.9420039242223718
Independent (H_0 holds true)

Chi-square test sugguests that there is no significant relationship between Loan_Status and Self_Employed.

Loan_Status and Loan_Amount_Term¶

tbl_term = pd.crosstab(index = df['Loan_Status'], columns = df['Loan_Amount_Term'])
tbl_term.index = ['Not Approved','Approved']
tbl_term

stat, p, dof, exptected = chi2_contingency(tbl_term)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.12185762421276597
Independent (H_0 holds true)

Loan_Status and Credit_History¶

tbl_crt = pd.crosstab(index = df['Loan_Status'], columns = df['Credit_History'])
tbl_crt.index = ['Not Approved','Approved']
tbl_crt.columns = ['Guidelines Not Met', 'Guidelines Met'] 
tbl_crt

Credit_History has 50 missing values
It seems to suggest that there are relatively more loans approved when the applicants' credit meets the guidelines

tbl_crt.plot.bar(xlabel = 'Loan_Status', rot = 0)

<AxesSubplot:xlabel='Loan_Status'>

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status\:and\:Credit\_History \\H_1:\: Negation\:of\:H_0 $

stat, p, dof, expected = chi2_contingency(tbl_crt)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 7.184759548750746e-40
Dependent (reject H_0)

Loan_Status and Property_Area¶

tbl_area = pd.crosstab(index = df['Loan_Status'], columns = df['Property_Area'])
tbl_area.index = ['Not Approved','Approved']
tbl_area

tbl_area.plot.bar(xlabel = 'Loan Status', rot = 0)

<AxesSubplot:xlabel='Loan Status'>

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:Loan\:Status\:and\:Property\_Area \\H_1:\: Negation\:of\:H_0 $

stat, p, dof, exptected = chi2_contingency(tbl_area)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.0021360187811644915
Dependent (reject H_0)

Categorical and Continuous¶

Loan_Status and ApplicantIncome¶

df['ApplicantIncome'].groupby(df['Loan_Status']).describe()

# data set of Loan_Status Not Approved
tbl_appN = df[(df['Loan_Status'] == 'N')]
tbl_appN = tbl_appN['ApplicantIncome']
tbl_appN

1       4583
7       3036
9      12841
13      1853
17      3510
       ...  
596     6383
597     2987
600      416
605     2400
613     4583
Name: ApplicantIncome, Length: 192, dtype: int64

# data set of Loan_Status Approved
tbl_appY = df[(df['Loan_Status'] == 'Y')]
tbl_appY = tbl_appY['ApplicantIncome']
tbl_appY

0      5849
2      3000
3      2583
4      6000
5      5417
       ... 
608    3232
609    2900
610    4106
611    8072
612    7583
Name: ApplicantIncome, Length: 422, dtype: int64

Two sample Z-Test (ApplicantIncome splitted by Loan_Status)

$$ H_0: \:\mu_{N_{income}} =\:\mu_{Y_{income}} \\H_1:\: Negation\:of\:H_0 $$

Assuming two levels of samples are independent.

ztest, pval = stests.ztest(tbl_appN,tbl_appY,value = 0, alternative = 'two-sided')
alpha = 0.05
print("p value is " + str(pval))
if pval <= alpha:
    print('Two population means are not equal (reject H_0)')
else:
    print('Two population means are equal (H_0 holds true)')

p value is 0.9072495909325038
Two population means are equal (H_0 holds true)

Conclusion

There is no evidence that the difference in Applicant income will have an impact on loan eligibility.

fig, ax = plt.subplots(figsize = (6,5))
df.boxplot(column = 'ApplicantIncome', by = 'Loan_Status', ax = ax, grid = False)
plt.show()

Loan_Status and LoanAmount¶

df['LoanAmount'].groupby(df['Loan_Status']).describe()

tbl_amtN = df[(df['Loan_Status'] == 'N')]
tbl_amtN = tbl_amtN['LoanAmount'].dropna() # we know there are missing values
tbl_amtN

1      128.0
7      158.0
9      349.0
13     114.0
17      76.0
       ...  
591    205.0
596    187.0
597     88.0
600    350.0
613    133.0
Name: LoanAmount, Length: 181, dtype: float64

tbl_amtY = df[(df['Loan_Status'] == 'Y')]
tbl_amtY = tbl_amtY['LoanAmount'].dropna() # missing values dropped
tbl_amtY

2       66.0
3      120.0
4      141.0
5      267.0
6       95.0
       ...  
608    108.0
609     71.0
610     40.0
611    253.0
612    187.0
Name: LoanAmount, Length: 411, dtype: float64

Two sample Z-Test (LoanAmount splitted by Loan_Status)

$$ H_0: \:\mu_{N_{LoanAmt}} =\:\mu_{Y_{LoanAmt}} \\H_1:\: Negation\:of\:H_0 $$

Assuming two levels of samples are independent.

ztest, pval = stests.ztest(tbl_amtN,tbl_amtY,value = 0, alternative = 'two-sided')
alpha = 0.05
print("p value is " + str(pval))
if pval <= alpha:
    print('Two population means are not equal (reject H_0)')
else:
    print('Two population means are equal (H_0 holds true)')

p value is 0.36436593320314825
Two population means are equal (H_0 holds true)

Conclusion

There is no evidence that the difference in Loan Amount will result in lower/higher loan acceptance rate.

fig, ax = plt.subplots(figsize = (6,5))
df.boxplot(column = 'LoanAmount', by = 'Loan_Status', ax = ax, grid = False)
plt.show()

Continuous and Contintuous¶

ApplicantIncome vs. LoanAmount¶

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(df['ApplicantIncome'],df['LoanAmount'])
plt.title('ApplicantIncome and LoanAmount Distribution')
plt.xlabel('Applicant Income')
plt.ylabel('Loan Amount')
plt.show()

From the above scatter plot, it looks like there is a positive correlation between 'LoanAmount' and 'ApplicantIncome'. Yet here we have not performed any missing data nor outlier treatment, thus it is too early to make a conclusion.

CoapplicantIncome vs. LoanAmount¶

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(df['CoapplicantIncome'],df['LoanAmount'])
plt.title('CoapplicantIncome and LoanAmount Distribution')
plt.xlabel('Coapplicant Income')
plt.ylabel('Loan Amount')
plt.show()

It is hard to make any conclusion from the above scatter plot, but it looks like a portion of the points sugguests the positive correlation. Ideally to double check, we would calculate the correlation coefficient, but due to difference in two variables' shapes, we have to perform missing values treatment in order to proceed with correlation calculation.

Section Summary (where the missing values are simply ignored)¶

Predictor variables that are observed to have a statistacally significant association with the Target varialbe are as follows:

Married
Education
Credit_History
Property_Area

Missing value Treatment¶

From previous analysis, we know most of the variables have at least one missing value, below will examine the possible cause of the occurence of the missing values, in my opinion, for each variable.

Gender: Most likely be missing completely at random(MCAR)
Married: MCAR
Dependents: MCAR
Self_Employed: MCAR
LoanAmount: MCAR
Loan_Amount_Term: MCAR
Credit_History: Missing at random. Applicants who do not meeting credit guidelines tend to have higher missing value compare to those whose credit met the guidelines.

Statistically speaking, if the number of missing observations is less than 5% of the sample, we can drop them. In addition, if the missing value is determined to be MCAR, we can safely perform deletion of missing value cases. With that said, missing observations of Gender, Married, Dependents, Self_Employed, LoanAmount and Loan_Amount_Term can be dropped.

As for Credit_History, imputation will be conducted to replace the missing ones.

Credit_History Imputation¶

Before the actual imputation, we should examine if there's any correlation between Credit_History and other variables.

Credit_history vs Gender¶

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Gender \\H_1:\: Negation\:of\:H_0 $

tbl_cGender = pd.crosstab(df['Credit_History'],df['Gender'])
stat, p, dof, exptected = chi2_contingency(tbl_cGender)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.8164953298683041
Independent (H_0 holds true)

Credit_History vs Married¶

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Married \\H_1:\: Negation\:of\:H_0 $

tbl_cMarried = pd.crosstab(df['Credit_History'],df['Married'])
stat, p, dof, exptected = chi2_contingency(tbl_cMarried)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.9864309067859328
Independent (H_0 holds true)

Credit_History vs Education¶

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Education \\H_1:\: Negation\:of\:H_0 $

tbl_cEd = pd.crosstab(df['Credit_History'],df['Education'])
stat, p, dof, exptected = chi2_contingency(tbl_cEd)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.0714694601906817
Independent (H_0 holds true)

Credit_History vs Self_Employed¶

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Self\_Employed \\H_1:\: Negation\:of\:H_0 $

tbl_cEmp = pd.crosstab(df['Credit_History'],df['Self_Employed'])
stat, p, dof, exptected = chi2_contingency(tbl_cEmp)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.9376796063240526
Independent (H_0 holds true)

Credit_History vs Property_Area¶

Chi-square Test

$ H_0: \:There \:is \:no \:relationship \:between \:variables\:Credit\_History\:and\:Property\_Area \\H_1:\: Negation\:of\:H_0 $

tbl_cArea = pd.crosstab(df['Credit_History'],df['Property_Area'])
stat, p, dof, exptected = chi2_contingency(tbl_cArea)
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H_0)')
else:
    print('Independent (H_0 holds true)')

p value is 0.6020428719957087
Independent (H_0 holds true)

Credit_History vs ApplicantIncome¶

Two sample Z-Test (ApplicantIncome splitted by Loan_Status)

$$ H_0: \:\mu_{N_{applicantIncome}} =\:\mu_{Y_{applicantIncome}} \\H_1:\: Negation\:of\:H_0 $$

df['ApplicantIncome'].groupby(df['Credit_History']).describe()

tbl_incomeN = df[(df['Credit_History'] == 0.0)]
tbl_incomeN = tbl_incomeN['ApplicantIncome']
tbl_incomeN

7      3036
17     3510
20     7660
22     2600
23     3365
       ... 
576    3087
584    2787
589    2726
597    2987
613    4583
Name: ApplicantIncome, Length: 89, dtype: int64

tbl_incomeY = df[(df['Credit_History'] == 1.0)]
tbl_incomeY = tbl_incomeY['ApplicantIncome']
tbl_incomeY

0      5849
1      4583
2      3000
3      2583
4      6000
       ... 
608    3232
609    2900
610    4106
611    8072
612    7583
Name: ApplicantIncome, Length: 475, dtype: int64

ztest, pval = stests.ztest(tbl_incomeN,tbl_incomeY,value = 0, alternative = 'two-sided')
alpha = 0.05
print("p value is " + str(pval))
if pval <= alpha:
    print('Two population means are not equal (reject H_0)')
else:
    print('Two population means are equal (H_0 holds true)')

p value is 0.7271819479920407
Two population means are equal (H_0 holds true)

Observation: There isn't any correlation between Credit_History and other predictor variables.

Imputation using Most Frequent values¶

import numpy as np
from sklearn.impute import SimpleImputer
imr = SimpleImputer(missing_values = np.nan,strategy = 'most_frequent')
imr = imr.fit(df[['Credit_History']])
df['Credit_History'] = imr.transform(df[['Credit_History']]).ravel()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     614 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

Deletion¶

For simplicity, we will delete observations where any of the variables is missing due to MCAR.

# if proceed with dropping NaN before imputation
# the sample size will reduce down to 480
# yet if we drop observations where missing values are found after imputation
# the sample size is reduced to 523 rather
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 523 entries, 1 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            523 non-null    object 
 1   Gender             523 non-null    object 
 2   Married            523 non-null    object 
 3   Dependents         523 non-null    object 
 4   Education          523 non-null    object 
 5   Self_Employed      523 non-null    object 
 6   ApplicantIncome    523 non-null    int64  
 7   CoapplicantIncome  523 non-null    float64
 8   LoanAmount         523 non-null    float64
 9   Loan_Amount_Term   523 non-null    float64
 10  Credit_History     523 non-null    float64
 11  Property_Area      523 non-null    object 
 12  Loan_Status        523 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 57.2+ KB

Outlier Treatment¶

From previous analysis, where we ignored the missing-value observations, variables ApplicantIncome, CoapplicantIncome, and LoanAmount are found to have univariate outliers in the upper boundaries. Now let's see if there's any changes on outliers after our missing value treatment.

df[['ApplicantIncome','CoapplicantIncome','LoanAmount']].describe()

sns.boxplot(df['ApplicantIncome'])
sns.despine()

IQR_1 = df.ApplicantIncome.quantile(0.75) - df.ApplicantIncome.quantile(0.25)
upper_limit_1 = df.ApplicantIncome.quantile(0.75) + (IQR_1*1.5)
upper_limit_extreme_1 = df.ApplicantIncome.quantile(0.75) + (IQR_1*2)
upper_limit_1, upper_limit_extreme_1

(10234.25, 11704.0)

outlier_count_1 = len(df[(df['ApplicantIncome'] > upper_limit_1)])
outlier_count_1 
# notice the number of the outliers did not change after missing-value treatment

43

Trimming¶

We will use Trimming here as our outlier treatment method.

pd.set_option('mode.chained_assignment', None)
index_1 = df[(df['ApplicantIncome'] >= upper_limit_1)].index
#index_1
df.drop(index_1, inplace = True)
outlier_ct_1 = len(df[(df['ApplicantIncome'] > upper_limit_1)])
outlier_ct_1

0

sns.boxplot(df['CoapplicantIncome'])
sns.despine()

IQR_2 = df.CoapplicantIncome.quantile(0.75) - df.CoapplicantIncome.quantile(0.25)
upper_limit_2 = df.CoapplicantIncome.quantile(0.75) + (IQR_2*1.5)
upper_limit_extreme_2 = df.CoapplicantIncome.quantile(0.75) + (IQR_2*2)
upper_limit_2, upper_limit_extreme_2

(5842.5, 7011.0)

outlier_count_2 = len(df[(df['CoapplicantIncome'] > upper_limit_2)])
outlier_count_2 
# number of outliers reduced from 17 to 16 after missing-value treatment

15

#pd.set_option('mode.chained_assignment', None)
index_2 = df[(df['CoapplicantIncome'] >= upper_limit_2)].index
#index_2
df.drop(index_2, inplace = True)
outlier_ct_2 = len(df[(df['CoapplicantIncome'] > upper_limit_2)])
outlier_ct_2

0

sns.boxplot(df['LoanAmount'])
sns.despine()

IQR_3 = df.LoanAmount.quantile(0.75) - df.LoanAmount.quantile(0.25)
upper_limit_3 = df.LoanAmount.quantile(0.75) + (IQR_3*1.5)
upper_limit_extreme_3 = df.LoanAmount.quantile(0.75) + (IQR_3*2)
upper_limit_3, upper_limit_extreme_3

(242.5, 271.0)

outlier_count_3 = len(df[(df['LoanAmount'] > upper_limit_3)])
outlier_count_3 
# number of outliers reduced from 30 to 23 after missing-value treatment

23

#pd.set_option('mode.chained_assignment', None)
index_3 = df[(df['LoanAmount'] >= upper_limit_3)].index
#index_3
df.drop(index_3, inplace = True)
outlier_ct_3 = len(df[(df['LoanAmount'] > upper_limit_3)])
outlier_ct_3

0

Variable Correlations After Data Cleaning¶

#encoding to numeric data type
code_numeric = {'Male':1, 'Female':2,
               'Yes': 1, 'No':2,
                'Graduate':1, 'Not Graduate':2,
                'Urban':1, 'Semiurban':2, 'Rural':3,
                'Y':1, 'N':0,
                '3+':3 }
df = df.applymap(lambda i: code_numeric.get(i) if i in code_numeric else i)
df['Dependents'] = pd.to_numeric(df.Dependents)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 442 entries, 1 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            442 non-null    object 
 1   Gender             442 non-null    int64  
 2   Married            442 non-null    int64  
 3   Dependents         442 non-null    int64  
 4   Education          442 non-null    int64  
 5   Self_Employed      442 non-null    int64  
 6   ApplicantIncome    442 non-null    int64  
 7   CoapplicantIncome  442 non-null    float64
 8   LoanAmount         442 non-null    float64
 9   Loan_Amount_Term   442 non-null    float64
 10  Credit_History     442 non-null    float64
 11  Property_Area      442 non-null    int64  
 12  Loan_Status        442 non-null    int64  
dtypes: float64(4), int64(8), object(1)
memory usage: 48.3+ KB

matrix = np.triu(df.corr())
fig, ax = plt.subplots(figsize = (10,10))
sns.heatmap(df.corr(), annot = True, mask = matrix, linewidths = .5, ax = ax)

<AxesSubplot:>

Conclusion¶

Depending on different methods of missing-value and/or outlier treatments, the level of data loss might have impact on prediction models which are used to solve the original problem. This project is only to explore the dataset and to describe dataset characterizations through data visualization and statistical techniques. Below shows how much each variable correlates with the target variable.

m = matrix[:,11]
m = pd.DataFrame(m)
m1 = np.transpose(m)

m1.columns = (df.columns[1:])
m2 = np.transpose(m1)
new_col = ['corr_to_Loan_Status']
m2.columns =new_col

# sort predictor variables by their correlation strength to the Target variable
m2['corr_to_Loan_Status'] = m2['corr_to_Loan_Status'].abs()
m2.sort_values(by = new_col, ascending = False)

	ApplicantIncome	CoapplicantIncome	LoanAmount
count	614.000000	614.000000	592.000000
mean	5403.459283	1621.245798	146.412162
std	6109.041673	2926.248369	85.587325
min	150.000000	0.000000	9.000000
25%	2877.500000	0.000000	100.000000
50%	3812.500000	1188.500000	128.000000
75%	5795.000000	2297.250000	168.000000
max	81000.000000	41667.000000	700.000000

	count	mean	std	min	25%	50%	75%	max
Loan_Status
N	192.0	5446.078125	6819.558528	150.0	2885.0	3833.5	5861.25	81000.0
Y	422.0	5384.068720	5765.441615	210.0	2877.5	3812.5	5771.50	63337.0

	count	mean	std	min	25%	50%	75%	max
Loan_Status
N	181.0	151.220994	85.862783	9.0	100.0	129.0	176.0	570.0
Y	411.0	144.294404	85.484607	17.0	100.0	126.0	161.0	700.0

	count	mean	std	min	25%	50%	75%	max
Credit_History
0.0	89.0	5679.438202	9301.896041	1500.0	2787.0	3547.0	5285.0	81000.0
1.0	475.0	5426.526316	5535.393384	150.0	2908.5	3859.0	5820.0	63337.0

	ApplicantIncome	CoapplicantIncome	LoanAmount
count	523.000000	523.000000	523.000000
mean	5316.097514	1604.498891	145.485660
std	5549.509255	2576.549437	83.223682
min	150.000000	0.000000	9.000000
25%	2885.500000	0.000000	100.000000
50%	3850.000000	1167.000000	128.000000
75%	5825.000000	2292.500000	171.000000
max	81000.000000	33837.000000	650.000000

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001002	Male	No	0	Graduate	No	5849	0.0	NaN	360.0	1.0	Urban	Y
1	LP001003	Male	Yes	1	Graduate	No	4583	1508.0	128.0	360.0	1.0	Rural	N
2	LP001005	Male	Yes	0	Graduate	Yes	3000	0.0	66.0	360.0	1.0	Urban	Y
3	LP001006	Male	Yes	0	Not Graduate	No	2583	2358.0	120.0	360.0	1.0	Urban	Y
4	LP001008	Male	No	0	Graduate	No	6000	0.0	141.0	360.0	1.0	Urban	Y

	corr_to_Loan_Status
Loan_Status	1.000000
Credit_History	0.541256
Married	0.110421
Education	0.099422
Gender	0.090109
Property_Area	0.064360
CoapplicantIncome	0.055412
Self_Employed	0.038612
ApplicantIncome	0.030803
Dependents	0.014669
Loan_Amount_Term	0.014219
LoanAmount	0.013060

	Guidelines Not Met	Guidelines Met
Not Approved	82	97
Approved	7	378