Learn Data Science and Machine Learning

Gopikrishna Yadam
6 min readMay 23, 2021

As our target audience are the people who has an idea on programming, I started my approach in a reversed approach to show what type of concepts are required in real time (code), by taking the solution of an online data science competition problem statement, which will help us to know the topics to go through in the near future.

Problem Statement: A company named X deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.

Problem

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.

Data

Variable Description Loan_ID Unique Loan ID Gender Male/ Female Married Applicant married (Y/N) Dependents Number of dependents Education Applicant Education (Graduate/ Under Graduate) Self_Employed Self employed (Y/N) ApplicantIncome Applicant income CoapplicantIncome Coapplicant income LoanAmount Loan amount in thousands Loan_Amount_Term Term of loan in months Credit_History credit history meets guidelines Property_Area Urban/ Semi Urban/ Rural Loan_Status Loan approved (Y/N)

Solution for the above problem statement:

#CODE1

The first and foremost thing to build solution is to understand the distribution of the data, by slicing and dicing with its variables. We should understand how each variable is impacting the target variable. This area is called as DATA ANALYSIS.

So please try to concentrate more on this area, we can use Tableau, PowerBI and Excel etc., it is up to us on what we choose.

This area will drive us to an efficient algorithm selection as well as helps in improvising our solution.

E.g. After doing analysis, we assumed that Variable A has been closely related to the target variable. We placed that variable in training with an algorithm and obtained accuracy of 70%. But the truth is that, if we consider Variable B along with the Variable A in training, our accuracy will increase to 85%. So, accurate analysis is required.

Right now, I am concentrating on what areas to learn. That’s why I am not explaining the analysis we did, in the above problem statement.

#CODE2

And here comes one of the crucial part of problem solving. Every data is not proper, so we need to make it proper by following some data munging techniques.

What are Data Munging Techniques?

Let’s consider the data from the above problem statement. For the column named LoanAmount, having the value as amount of loan taken by a person, assume scenario where this column is having null values, will it cause any problem to our solution? Yes it will degrade the output efficiency.

So, there are multiple ways to handle different type of variables. This area of study is called EXPLORATORY ANALYSIS.

It requires a lot of statistical knowledge to understand each and every concept in this study.

#CODE 3

In this section we are learning Label Encoding.

What is this Label Encoding?

It’s a technique to encode labels with value between 0 and n_classes-1. It can be used to normalize labels as well as to transform non-numeric labels to numeric labels.

Let’s get into an example of a variable named X which is of categorical having 4 different type of categories, so with this technique, those 4 categories are encoded to 0, 1, 2 and 3 respectively instead of the name of category as shown below.

Variable X (String datatype) Variable X (int/float datatype)

A 0

B label encoding 1

C 2

D 3

Now the challenge is how to do this in R/Python.

Reason behind this process:

In the real world, labels are in the form of words, because words are human readable. We label our training data with words so that the mapping can be tracked. To convert word labels into numbers, we need to use a Label Encoder. This enables the algorithms to operate on our data.

#CODE 4

Here comes the crucial stage of the solution, which describes how good our analysis on the given data set is.

If we look into the code under #CODE 4 for the given problem statement, the number of algorithms used are 3.

  1. Logistic Regression Decision Tree 3. Random Forest

The concept we need to learn here is Types of machine learning algorithms and its usage.

Once the algorithm is selected and trained with the data, the next part is to calculate the accuracy score and error rate. To do that we have lot many measures like R square, confusion matrix, ME, RMSE and MAPE etc., but we should know which one to use depending upon our output. Next part is to cross violate our model based on some techniques like K-fold Cross-Validation. It is used to check the model accuracy when it has been trained and tested with multiple combination of training and test data. Need to look into Cross-Validation Techniques.

In the above section (#CODE 4), K-Fold technique is used for cross-validation of the model.

#CODE 5

Now here is the most interesting and final part of the solution. There are chances to achieve an accuracy score of 100% and cross-validation score of 78% for our algorithm. Because we got 100% accuracy score, does this mean our solution is damn good??? Not at all!!! This leads to an ultimate case of “OVERFITTING”.

Now what is this OVERFITTING?? I found one intuitive explanation in Quora, after reading that, I felt like it is worth to share. So pasting the same here without even changing a word, hope that will give the readers a clear idea on OVERFITTING.

So to handle the overfitting in data, we have techniques. But now, as we are concentrating on what concepts to learnt, we should see the feature importance matrix and then we will know the most important features.

In feature importance we will know importance of the variables used in the algorithm for training. Based on that, we can change our variables in training the data to fine-tune the output.

This concept is called Feature Engineering.

I hope this explanation will help you to get an idea on the concepts required for data science.

--

--