Using Ensemble Methods to predict Loans

Firdaws Ele-Ojo Yahya
8 min readDec 22, 2020

Recently, I came across the Analytics vidhya loan prediction hackathon and decided that I might as well try it out as my first solo hackathon. I also decided to try my hand at ensemble learning algorithms to better understand it as a concept. Before I get into my code, I believe I should do a brief summary on what ensembling methods are.

Ensemble learning makes use of a combination of decisions from different models in order to improve the performance of a model.Putting it in a real world application, it is like seeking a second opinion from different sources before coming to a decision yourself. Ensembling learning can be in different forms. The basic of this form involves majority voting(multiple models make a prediction and the highest predictions get picked as the final decision), averaging(average predictions is taken after the models make their prediction) and weighted averaging(all models are assigned weights to help define its importance and then added together).

For the advanced ensembling learning methods, we have stacking(predictions from different models is used to make a new model), Blending(similar to stacking but prediction is being made on the validation set), bagging(combination of results from all models to get a final result) and boosting(corrections of each subset are done after the model has done a prediction then the mean is taken)

In this blog post, I made use of three algorithms based on bagging and boosting . These were XGBOOST, ADABOOST & RANDOM FOREST.

Bagging can also be known as bootstrap aggregating. It is used in decreasing variance and overfitting in a model by gaining new knowledge into a data set through combination of other models through averaging or voting.This thereby leads to a high bias in a trade off for reducing the variance. Although it is mostly used on decision trees, it can also be used with other models. Random Forest and bagging meta estimator are known bagging algorithm.

Boosting is in the sense that a model is first built from the training data. Then a second one built to correct the errors from the first/previous model is being built.This process goes on till a completely accurate model is built or till the maximum number of models is reached. There are various Boosting algorithms but the ones I make use of is the XGBOOST(extreme gradient descent boosting) and ADABOOST(Adaptive boosting).

For this hackathon, we were given a test.csv file, train file and submission file(format in which we were to submit our predicted values)

On a normal basis, I make use of jupyter notebook for my codes but of recent, I have found myself using Google colab more often. For this project, I made use of google colab. You can also find my codes here.

I really do not plan on focusing too much on the methods I used for data wrangling and visualizations, but I definitely would do a walk through some of it.Keep in mind that whatever I did in the training data was done in the test data with the exception of the modelling and situations where I had to drop the target attribute.

NULL VALUES

Instead of dropping the null values, I filled them in with the mean,median or mode of each column depending on how much the missing value were. This is also known as the mode/mean/median imputation and is one of the various methods one can use to fill in missing values.

OUTLIERS

I had to check for outliers and to do that, I decided to use the z score. First the z score is being calculated and saved into z and a threshold of 3 is used to fish out the outliers. Taking you back to statistics, we can recall that z-score tells how many standard deviation we are from the mean. At 99% we are 3 standard deviation away from the mean. Anything after that indicates that the data point is different and can be termed an outlier. This is why you will most likely see people use 3 as their threshold. I go on to drop any value where z>3

VISUALIZATION

Visualizing the categorical columns, we can deduct that Male get more loans than females. Those who are married are favored more but the difference between married and single females when it comes to loans is of not much difference compared to the males. People less dependents were more likely to get a loan and being a graduate seemed more favorable. There is a large margin between those who got loan and those who did not under self employed. We can assume that it has no correlation to their loan status. A credit history of one is preferred and those who live in semi urban environment got the most loan approval compared to the other three. I also made use of box plot for outliers and the heatmap for correlation. You can find these in my codes.

ONE HOT ENCODING

To build a model, the data is expected to be converted to numerical form and one of those forms is the one hot encoding. This leads to extra features as each entry in a column becomes a column itself

Scaling is done on data with different magnitudes. This also gives our dataset a standard deviation of 1 and mean of zero. This is not only necessary when using data with different units but can also be said to be a standard for machine learning.Although, some tree based methods are indifferent to this, I find it necessary to employ. I make use of min-max scaling(normalization) to scale my data.

MODELING

I split my data into training and testing set with a size 0f 0.3 meaning 70% for testing and 30% for training. I then set LOAN_ID io the loan_ID in the original test set because it is necessary for the submission file.

XGBOOST

XGBOOST like its name suggest is a boosting ensemble algorithm.

Learning rate:prevents over fitting.range is 0 and 1. Makes the model more robust by shrinking the weights on each step

Max depth: helps determine the depth of tree for each round

Gama: form of regularization to punish models when they becom e for complex, Gamma controls whether a given node will split based on the expected reduction in loss after the split. A higher value leads to fewer splits. Supported only for tree-based learners

Colsample_bytree:percentage of features used by tree.Subsample ratio of columns when constructing each tree.

Reg_lambda:For regularization

N_estimators:number of trees you want to build/fit.

The MAE is the average magnitude of error in a prediction. It is used mainly for regression problems and well I doubt I should have used it here. Either ways, when you find yourself employing it, ensure that you try to get a lower MAE. The lower the better.

To read more on this, I suggest this post.

For this model, I get a training score of 0.85 and testing Accuracy of o.8. My model does well when it comes to predicting likelihood of a person getting a loan(1) and not the likelihood of them not getting the loan(0)

RANDOM FOREST

The ranndom forest is tund and the model gives a testing score of 0.83 and 0.74. I perfomr cross validation score and get an ROC score of 0.754

ADABOOST

This model is affected by outliers and noisy data.Using the grid search cv, I find the best hyper parameter tuning for model based on n_estimators and learning_rate. For a base model, I use the decision tree classifier.Then I fit my model before perfroming across validation on my model. I get a training score of and testing call of

PREDICTING ON TEST FILE

The submission file consisted only of the LOAN ID and the predicted value converted to ‘Y’ and ’N’. For the adaboost, I load the test file into ada.predict() and save it to predict. A data frame submission3 is the saved to contain the LOANID and Loan Status. The value “‘1”’and “0” in the prediction file is then replaced as required and then converted into a csv file. To view the file csv is then imported and opened. This method was used to create the submissionn file for other models too.

CONCLUSION

I have a lot more to do in order to improve my model and I have a few evaluation metrics I wish to explore. I also believe I can improve a better method when it comes to selecting hyperparameters to tune. On submitting my file, I got a score of 0.77 on my XGBOOST AND ADABOOST and 0.69 for random forest. Although my score is not too bad, seeing a lot of better score just makes me want to see how much I can strengthen my model. Like many others who were able to top the leaderboard, the all started from a low score and kept improving their model to get a better score. For each changes I make, I shall try merging it with my file on github while also trying to make my code a bit more detailed. what next? Maybe play around with model deployment based on this loan prediction file. I am excited to blog about that.

Originally published at https://datasciencewithfiddy.wordpress.com on December 22, 2020.

--

--

Firdaws Ele-Ojo Yahya

living life like i'm in a ghibli movie. A data science enthusiast with a love for the finer things in life