Dataanalysisusingpythoncourseramooclist Lea Lending

Kenji Sato
-
dataanalysisusingpythoncourseramooclist lea lending

Harvard University CS109A Summer 2018 Kenneth Brown - David Gil Garcaa - Nikat Patel import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import seaborn as sns import re import statsmodels.api as sm from matplotlib import cm from statsmodels.api import OLS from pandas.tools.plotting import scatter_matrix from pandas import scatter_matrix import scipy as sci from sklearn import preprocessing from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression, LogisticRegression %matplotlib inline After reviewing all the available datasets we notice that the features change from 2015 onward and appear to be more informative than on previous years.

Each row corresponds to a snapshot of a loan at the end of the period and has information on the loan itself and the borrowers. Since LendingClub does not provide a unique id for loans or borrowers it’s not posible to join together several periods to increase the amount of data because we’d be repeating information on loans at differente times, which would distort the outcome of the study.

# Remove % sign from 'int_rate' column and cast it to a float for i in range(len(loan_df['int_rate'].values)): loan_df['int_rate'].values[i] = str(loan_df['int_rate'].values[i])[:-1] if (loan_df['int_rate'].values[i] == 'na'): loan_df['int_rate'].values[i] = '0' else: loan_df['int_rate'].values[i] = loan_df['int_rate'].values[i] loan_df['int_rate'] = pd.to_numeric(loan_df['int_rate']) After loading and cleaning the data we start by making simple visualizations, grouping and descriptive statistics of the dataset by different features to have a first glance at the data. We understand that Lending Club grades loans by their risk which translates in higher risk loans paying higher interests and vice versa.

Understanding this and considering the goal of the analysis we decide to work on an initial hypothesis: - If we can understand how Lending Club grades a loan, we should be able to improve on their grading criteria, “regrade” loans and invest in higher return loans that are less risky than LC graded to be.

To achieve this we will work on three strategies: - Build a model that accurately grades loans by Lending Club’s standards - Build a model that accurately predicts the likelyhood of default - Combine Lending Club’s data with macro economic indicators that can give us exogenous confounding variables that would potentially increase the predicting accuracy of both models and thus, our competitive advantage loan_df.groupby(['loan_status','grade']).agg({ 'loan_status' : np.size, 'loan_amnt' : np.mean, 'int_rate' : np.mean }) A first view at the distribution of loans by their status shows us that there is no evident logic as to how a loan will come to term just by looking at their grade, amount or interest rate.

# Try to find the variables that LC considers to assign their grade. gradingMat = loan_df[['grade','loan_amnt','annual_inc','term','int_rate','delinq_2yrs','mths_since_last_delinq', 'emp_length','home_ownership','pub_rec_bankruptcies','tax_liens']] gradingMatDumm = pd.get_dummies(gradingMat, columns=['grade', 'term','emp_length','home_ownership']) fig, ax = plt.subplots(figsize=(10, 10)) corr = gradingMatDumm.corr() sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(10, 200, as_cmap=True, center = 'light'), square=True, ax=ax); ax.set_title('Lending Club Grading Criteria', fontsize=18) Text(0.5,1,'Lending Club Grading Criteria') We can see that grade is related to loan amount and term of loan. Grade A seems to consider more variables, such as deliquency in the past 2 years and bankrupcies.

We’ll explore now if there are signs of features being potential good predictors of a loan resulting in a default, this being determined by loans with hardship flag != N, for which late recovery fees have been collected, loans with percentage of never delinquent != 100, loans with deliquent amounts > 0 or loan status being “charged off”. # We create a new feature that will inform of what we'll be considering a default, which we'll use as an outcome that # we'll want to avoid.

loan_df['risk'] = int(0) badLoan = ['Charged Off','Late (31-120 days)', 'Late (16-30 days)', 'In Grace Period'] loan_df.loc[(loan_df['delinq_amnt'] > 0) | (loan_df['pct_tl_nvr_dlq'] != 100) | (loan_df['total_rec_late_fee'] > 1) | (loan_df['delinq_2yrs'] > 0) | (loan_df['pub_rec_bankruptcies'] > 0) | (loan_df['debt_settlement_flag'] != 'N') | loan_df['loan_status'].isin(badLoan),'risk'] = 1 predDefault = loan_df[['risk','grade','loan_amnt','annual_inc','term','int_rate','emp_length','tax_liens','total_acc', 'total_cu_tl','hardship_loan_status','num_sats','open_rv_24m','pub_rec','tax_liens', 'tot_coll_amt','tot_hi_cred_lim','total_bal_ex_mort','total_cu_tl']] predDefault = pd.get_dummies(predDefault, columns=['grade', 'term'], drop_first = False) f, ax = plt.subplots(figsize=(10, 10)) corr = predDefault.corr() sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(10, 200, as_cmap=True, center = 'light'), square=True, ax=ax) <matplotlib.axes._subplots.AxesSubplot at 0x1c54840eb8> We see some correlations that may be interesting to explore further between features that indicate potential risk (as by the new feature added to the dataset) and others.

We see that the variables by which Lending Club seems to grade loans do indeed have a potential effect on risk (such as term of loan or loan ammount) but we also see others that they don’t seem to take in so much consideration as having tax lien or derogatory public records. It’s interesting to point out that better grades, as assigned by Lending Club, don’t necessarily correspond with less risk of default, as seen by “charged off” having a negative correlation with Grade G and positive with better levels.

In favor of Lending Club’s grading system we see that there seems to be an intrinsic higher risk on higher interest paying loans, at least through this rough preliminary analysis.

loans_df = full_loan_stats.copy() loans_df = loans_df.select_dtypes(include=['float64']).join(loans_df[target_col]) def get_columns_to_drop(df): """Returns a list of columns from df that is all NaN""" columns_to_drop = [] for col in loans_df.columns: unique_rows = loans_df[col].unique() if (unique_rows.size == 1 and not isinstance(unique_rows[0], str) and np.isnan(unique_rows[0])): columns_to_drop.append(col) return columns_to_drop # drop columns that contains all NaN values loans_df_columns_to_drop = get_columns_to_drop(loans_df) loans_df = loans_df.drop(loans_df_columns_to_drop, axis=1) loans_df.shape (421095, 55) After cleaning the dataset of loans_df , we were able to reduce from 145 categories to 55 categories.

loans_df.head() 5 rows × 55 columns loans_df.describe() 8 rows × 54 columns # Split into test and train loans_df = loans_df.fillna(0) train_loan_df, test_loan_df = train_test_split(loans_df, test_size=0.2) y_loan_train = train_loan_df[target_col] X_loan_train = train_loan_df.drop(target_col, axis=1) y_loan_test = test_loan_df[target_col] X_loan_test = test_loan_df.drop(target_col, axis=1) # For a character letter pair from A1 to G5 returns it's order in the list # E.g.

A1 = 1, A2 = 2, B1 = 6, G5 = 35 def grade_to_int(grade): if len(grade) != 2: return 0 let = ord(grade[0]) - ord('A') dig = int(grade[1]) return let * 5 + dig def num_to_grade(num): let = chr(int(np.floor(num/5)) + ord('A')) dig = int(np.round(num)) % 5 # 1 indexed grades if dig == 0: let = chr(ord(let) - 1) dig = 5 return let + str(dig) OLSModel = OLS(y_loan_train.map(grade_to_int), sm.add_constant(X_loan_train)).fit() r2_score(OLSModel.predict(sm.add_constant(X_loan_test)), y_loan_test.map(grade_to_int)) 0.3303237488371052 The OLS model resulted in a $R^2$ value of 0.3303.

fig, ax = plt.subplots(1,1, figsize=(10,10)) ax.scatter(OLSModel.predict(sm.add_constant(X_loan_test)), y_loan_test.map(grade_to_int), alpha=0.1) ax.set_xlabel('$R^2$ Values', fontsize=18) ax.set_ylabel(target_col, fontsize=18) y_values = dict((i, num_to_grade(i)) for i in list(range(1, 36))) ax.set_yticks(list(y_values.keys())) ax.set_yticklabels(list(y_values.values())) ax.set_title('OLS Model $R^2$ Values', fontsize=18) Text(0.5,1,'OLS Model $R^2$ Values') model = LinearRegression() selector = RFE(model).fit(X_loan_train, y_loan_train.map(grade_to_int)) r2_score(selector.predict(X_loan_test), y_loan_test.map(grade_to_int)) -0.48433446033411376 That’s worse than before. Let’s try again and only remove 1/4th of the features to see the $R^2$ improves. selector = RFE(model, 77).fit(X_loan_train, y_loan_train.map(grade_to_int)) r2_score(selector.predict(X_loan_test), y_loan_test.map(grade_to_int)) 0.34345254884079535 Much better!

With more trials we can probably find the optimum number of features to remove fig, ax = plt.subplots(1,1, figsize=(10,10)) ax.scatter(OLSModel.predict(sm.add_constant(X_loan_test)), y_loan_test.map(grade_to_int), alpha=0.1) ax.set_xlabel('$R^2$ Values', fontsize=18) ax.set_ylabel(target_col, fontsize=18) y_values = dict((i, num_to_grade(i)) for i in list(range(1, 36))) ax.set_yticks(list(y_values.keys())) ax.set_yticklabels(list(y_values.values())) ax.set_title('OLS RFE Model $R^2$ Values', fontsize=18) Text(0.5,1,'OLS RFE Model $R^2$ Values') Let’s try our Logistic Regressions on our dataset. Future plans are to include cross validations with Ridge/Lasso Regularization to prevent overfitting. Due to the limited computational resources, the following executations can take a while to process.

model = LogisticRegression() model.fit(X_loan_train, y_loan_train.map(grade_to_int)) model.score r2_score(model.predict(X_loan_test), y_loan_test.map(grade_to_int)) plt.scatter(model.predict(X_loan_test), y_loan_test.map(grade_to_int), alpha=0.1) plt.show() model = LogisticRegression() selector = RFE(model).fit(X_loan_train, y_loan_train.map(grade_to_int)) r2_score(selector.predict(X_loan_test), y_loan_test.map(grade_to_int)) selector = RFE(model, 77).fit(X_loan_train), y_loan_train.map(grade_to_int), step=5) r2_score(selector.predict(X_loan_test), y_loan_test.map(grade_to_int)) plt.scatter(selector.predict(X_loan_test), y_loan_test.map(grade_to_int), alpha=0.1) Let’s try to use feature engineering to improve the model. We will be importing the average adjusted gross income based on the applicant’s zip code and determining if the demographic area is significant. This dataset was available publicly on www.irs.gov/statistics.

Since the dataset from LendingTree only contains the first 3 digits of the applicant’s zip code, we will use the average of the first 3 digits of the zip code for the demographic adjusted gross income. Future plans are to include cross validations with Ridge/Lasso Regularization to prevent overfitting. Time for some data loading and cleaning to see the results!

# Adding `zip_code` from `full_loans_df`, since it was removed earlier loans_df['zip_code'] = full_loan_stats['zip_code'].str[:3].astype(np.int64) # IRS specifies NI as Number of Returns and A00100 as Total Adjusted Gross Income full_agi = pd.read_csv('15zpallnoagi.csv') agi_df = full_agi[['ZIPCODE', 'N1', 'A00100']].copy() agi_df['adj_gross_income'] = round((agi_df['A00100']/agi_df['N1'])*1000, 2) agi_df['zip_code'] = agi_df['ZIPCODE'].astype(str).str[:3].astype(np.int64) # Group the adjusted gross income by the first three digits of the zip code agi_df = agi_df.groupby(['zip_code'], as_index=False)['adj_gross_income'].mean() agi_df = agi_df.round({'adj_gross_income': 2}) # Use a left join to join `agi_df` onto `loans_df` loans_df = pd.merge(loans_df, agi_df, how='left', on=['zip_code']) loans_df = loans_df.fillna(0) train_loan_df, test_loan_df = train_test_split(loans_df, test_size=0.2) y_loan_train = train_loan_df[target_col] X_loan_train = train_loan_df.drop(target_col, axis=1) y_loan_test = test_loan_df[target_col] X_loan_test = test_loan_df.drop(target_col, axis=1) OLSModel = OLS(y_loan_train.map(grade_to_int), sm.add_constant(X_loan_train)).fit() r2_score(OLSModel.predict(sm.add_constant(X_loan_test)), y_loan_test.map(grade_to_int)) 0.34136288693706146 The OLS model with adjusted gross income resulted in a $R^2$ value of 0.3414.

The OLS model with no income resulted in a $R^2$ value of 0.3303. The adjusted gross income did improve our model, but was not signicantly. fig, ax = plt.subplots(1,1, figsize=(10,10)) ax.scatter(OLSModel.predict(sm.add_constant(X_loan_test)), y_loan_test.map(grade_to_int), alpha=0.1) ax.set_xlabel('$R^2$ Values', fontsize=18) ax.set_ylabel(target_col, fontsize=18) y_values = dict((i, num_to_grade(i)) for i in list(range(1, 36))) ax.set_yticks(list(y_values.keys())) ax.set_yticklabels(list(y_values.values())) ax.set_title('OLS Adjusted Gross Income $R^2$ Values', fontsize=18) Text(0.5,1,'OLS Adjusted Gross Income $R^2$ Values') model = LinearRegression() selector = RFE(model, 55).fit(X_loan_train, y_loan_train.map(grade_to_int)) r2_score(selector.predict(X_loan_test), y_loan_test.map(grade_to_int)) 0.32201252789623513 When using RFE, it lowers our $R^2$ value, but not significantly.

fig, ax = plt.subplots(1,1, figsize=(10,10)) ax.scatter(OLSModel.predict(sm.add_constant(X_loan_test)), y_loan_test.map(grade_to_int), alpha=0.1) ax.set_xlabel('$R^2$ Values', fontsize=18) ax.set_ylabel(target_col, fontsize=18) y_values = dict((i, num_to_grade(i)) for i in list(range(1, 36))) ax.set_yticks(list(y_values.keys())) ax.set_yticklabels(list(y_values.values())) ax.set_title('OLS RFE Adjusted Gross Income $R^2$ Values', fontsize=18) Text(0.5,1,'OLS RFE Adjusted Gross Income $R^2$ Values') LC strongly bases their grading system on FICO, loan amount and term of the loan, which in turn does not make it too different from the traditional banking system.

We would like to build a model based on more complex predictors that can give an oportunity to a wider group of population by identifiying “good borrowers” that may not have a perfect FICO score but will not default.

People Also Asked

swetashetye/Lending-Club-Loan-Data-Analysis- - GitHubLENDING CLUB LOAN DATA ANALYSIS - nachanta.github.ioLending Club Loan Data Analysis - Deep Learning | KaggleExploratory Data Analysis | Lending Club LoansGitHub - ChaitanyaC22/Lending-Club-Project---Data-Analysis ...Lending Club Loan Defaulters ♂ Prediction - Kaggle?

Each row corresponds to a snapshot of a loan at the end of the period and has information on the loan itself and the borrowers. Since LendingClub does not provide a unique id for loans or borrowers it’s not posible to join together several periods to increase the amount of data because we’d be repeating information on loans at differente times, which would distort the outcome of the study.

LENDING CLUB LOAN DATA ANALYSIS - nachanta.github.io?

Understanding this and considering the goal of the analysis we decide to work on an initial hypothesis: - If we can understand how Lending Club grades a loan, we should be able to improve on their grading criteria, “regrade” loans and invest in higher return loans that are less risky than LC graded to be.

Exploratory Data Analysis | Lending Club Loans?

In favor of Lending Club’s grading system we see that there seems to be an intrinsic higher risk on higher interest paying loans, at least through this rough preliminary analysis.

GitHub - ChaitanyaC22/Lending-Club-Project---Data-Analysis ...?

Each row corresponds to a snapshot of a loan at the end of the period and has information on the loan itself and the borrowers. Since LendingClub does not provide a unique id for loans or borrowers it’s not posible to join together several periods to increase the amount of data because we’d be repeating information on loans at differente times, which would distort the outcome of the study.

Lending Club Loan Defaulters ♂ Prediction - Kaggle?

Each row corresponds to a snapshot of a loan at the end of the period and has information on the loan itself and the borrowers. Since LendingClub does not provide a unique id for loans or borrowers it’s not posible to join together several periods to increase the amount of data because we’d be repeating information on loans at differente times, which would distort the outcome of the study.